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This  paper  surveys  graphical  tools  developed  in  the  past  three  decades  that  are  applicable  to 
linear  structural  equation  models  (SEMs).  These  tools  permit  researchers  to  answer  key  re¬ 
search  questions  by  simple  path-tracing  rules,  even  for  highly  complex  models.  They  include 
parameter  identification,  causal  effect  identification,  regressor  selection,  selecting  instrumental 
variables,  finding  testable  implications  of  a  given  model,  identifying  equivalent  models  and 
estimating  counterfactual  relationships. 
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Introduction 

Recent  advances  in  graphical  models  have  had  a  trans¬ 
formative  impact  on  causal  analysis  and  machine  learning. 
Among  the  tasks  facilitated  by  graphical  models  are:  model 
testing,  identification,  policy  analysis,  bias  control,  medi¬ 
ation,  external  validity,  and  the  analysis  of  counterfactu¬ 
als  and  missing  data  (Pearl,  2014a).  Only  a  meager  por¬ 
tion  of  these  developments  have  found  their  way  to  main¬ 
stream  structural  equation  modeling  (SEM)  literature  which, 
by  and  large,  prefers  algebraic  over  graphical  representations 
(Joreskog  and  Sorbom,  1982;  Bollen,  1989;  Mulaik,  2009; 
Hoyle,  2012).  The  reason  for  this  disparity  rests  primarily 
in  the  fact  that  graphical  techniques  were  developed  for  non- 
parametric  analysis,  while  most  SEM  research  is  conducted 
within  the  confines  of  Gaussian  linear  models,  to  which  ma¬ 
trix  algebra  and  powerful  statistical  tests  are  applicable. 

The  purpose  of  this  paper  is  to  introduce  SEM  researchers 
to  modem  tools  of  graphical  models  and  to  describe  the  ben¬ 
efits,  as  well  as  new  insights  that  graphical  models  can  pro¬ 
vide.  These  include  new  methods  of  testing  models,  new 
ways  of  identifying  structural  parameters,  new  techniques  of 
reducing  confounding  bias,  and  new  paradigms  for  handling 
missing  data. 

To  make  this  paper  self  contained,  we  will  start  with 
the  basic  definitions  of  regression  analysis,  linear  structural 
equations  models,  path  analysis,  causal  effects,  and  Wright’s 
path  tracing  rules.  We  then  introduce  more  advanced  no¬ 
tions  of  graph  separation,  which  were  developed  for  non- 
parametric  analysis,  but  have  simple  and  meaningful  inter¬ 
pretation  in  linear  models.  These  tools  provide  the  basis  for 
model  testing  and  identification  criteria,  discussed  in  sub¬ 
sequent  sections.  We  then  cover  advanced  applications  of 
path  diagrams  including  equivalent  regressor  sets,  minimal 


regressor  sets,  and  variance  minimizing  for  causal  effect  es¬ 
timation.  Lastly,  we  discuss  counterfactuals  and  their  com¬ 
putation  in  linear  SEMs  before  showing  how  the  tools  pre¬ 
sented  in  this  paper  provide  simple  solutions  to  five  examples 
representing  non-trivial  problems  in  SEM  research. 

Preliminaries 

Expected  Value,  Covariance,  Regression,  and  Correla¬ 
tion 

The  expected  value  of  a  variable  X,  denoted  £’[X],  is  de¬ 
fined  as 

E[X\  =  Jx-  P{x)dx  (1) 

and  can  be  interpreted  as  the  “weighted  average”  of  X,  where 
P{x)  stands  for  the  probability  density  function  of  X. 

The  variance  of  X  is  defined  as 

4  =  E[X  -  E{X)f  (2) 

and  measures  the  degree  to  which  X  deviates  from  its  mean 
E[Xl 

The  covariance  of  X  and  Y  is  defined  as 

CTxY  =  E\{X-E\X\){Y-Em)\  (3) 

and  measures  the  degree  to  which  X  and  Y  vary  together. 

The  covariance  matrix  of  a  set  of  variables, 
{Xi,X2,  ...,X„},  is  the  matrix,  [crx-x^],  containing  the  co- 
variance  between  each  pair  of  variables  and  the  variance  of 
each  variable  along  the  diagonal. 

The  correlation  coefficient  of  X  and  Y,  pxY,  and  regres¬ 
sion  coefficient,  jdyx  of  F  on  X  are  two  other  measures  of 
association,  which  we  define  in  terms  of  the  covariance: 
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PXY 


0~XY 

CTxCTy 


jSyx  =  ^ 

^X 


O-Y 


-pXY 


(4) 

(5) 


pxY  is  a  normalized  measure  of  association  and  confined 
to  the  unit  interval;  0  <  pxY  <  1.  If  the  distribution  is  Gaus¬ 
sian  (assumed  for  the  remainder  of  the  paper),  the  regres¬ 
sion  coefficient,  Pyx^  represents  the  slope  of  the  least  squares 
line  in  the  prediction  of  Y  given  X\  Pyx  =  §^E{Y\X  =  x] 
for  all  A.  Finally,  notice  that  if  the  variables  have  been 
standardized  (also  assumed,  without  loss  of  generality,  for 
the  remainder  of  the  paper)  so  that  the  mean  and  variance 
of  each  variable  are  equal  to  0  and  1  respectively,  we  have 
o'xY  =  PxY  =  PxY  =  Pyx- 


Conditional  Expectation,  Partial  Covariance,  Partial 
Correlation,  and  Partial  Regression 

The  conditional  expectation  of  Y  given  X  =  x,  denoted 
E{Y\X  =  a],  is  the  expected  value  of  Y  when  X  =  x.  More 
formally. 


E\Y\X  =  x\  =  J  y  •  P{y\X  =  x)dy  (6) 

We  will  also  utilize  partial  covariances,  correlations,  and 
regressions,  which  measure  the  association  between  X  and 
Y  conditioned  on  a  given  set  of  variables,  Z.  For  example, 
the  partial  regression  coefficient  of  F  on  Z  given  Z  =  z  or 
z- specific  regression  coefficient  of  F  on  Z  is  given  by: 

/3yx.z=  -^E[Y\X  =  x,Z  =  z]  (7) 

OX 

In  words,  Pyx.z  is  the  slope  of  the  regression  line  of  F  on  X 
when  we  consider  only  cases  for  which  Z  -  z.  Since  we  are 
assuming  a  Gaussian  distribution,  Pyx.z^  does  not  change  for 
different  values  of  z,  which  allows  us  to  write  Pyx.z  -  fiYx.z^ 
where  Pyx.z  is  the  coefficient  for  X  when  we  regress  F  on  X 
and  Z. 

The  partial  correlation  coefficient,  pxY.z  can  be  defined  by 
normalizing  Pyx.z'- 


PxY.z  -  Pyx.z 


o'x.z 
cr  7.Z 


(8) 


A  well  known  result  in  regression  analysis  (Cramer,  1946) 
permits  us  to  express  Pyx.z,  o-yx.z,  or  Pyx.z  recursively  in 
terms  of  pair-wise  correlation  coefficients.  When  Z  is  a  sin¬ 
gleton,  this  reduction  reads: 


Pyx.z 


O' Yx.z  -  o-Yx- 


Pyx  -  pYzpxz 

[(1  -pU)]^ 

O'  YZO" zx 


o-t 


Pyx.z  - 


O' Y  Pyx  -  pYZPzx 


o-x 


'  Pxz 


(9) 

(10) 

(11) 


If  we  wish  to  reduce  pYx.zs ,  o-yx.zs  ,  or  Pyx.zs  when  Z  is  a 
singleton  and  S  a  set  containing  one  or  more  variables,  it  can 
be  done  as  follows: 


Pyx.zs  - 


PYX.S  -PYz.sPxz.s 
[(1  ~ Pyz.s^^^  ~ Pxz.s^^^ 


O' YX.ZS  -  O' YX.S 


O' YZ.S  O' zx.s 


cr: 


Z.S 


Pyx.zs  - 


O' Y.s  pYX.S  -  pYZ.SpZX.S 


O'x.s 


1  - 


Pxz.s 


(12) 

(13) 

(14) 


We  see  that  pYx.zs ,  o-yx.zs  ,  or  Pyx.zs  can  be  expressed  in 
terms  of  pair-wise  coefficients  by  recursively  applying  the 
above  formulas  for  each  element  of  S .  When  the  condition¬ 
ing  set  becomes  large,  this  procedure  can  yield  rather  compli¬ 
cated  expressions.  However,  if  our  aim  is  merely  to  identify 
vanishing  partial  correlations,  which  is  the  case  in  many  ap¬ 
plications,  we  can  be  spared  the  effort  entailed  by  this  recur¬ 
sion  and  use  graphical  criteria  instead.  These  are  reviewed  in 
the  subsection  on  d-separation. 


Linear  Structural  Equation  Models 

Structural  equation  models  (SEMs)  use  mathematical 
functions  to  describe  the  data  generating  mechanism  for  a 
set  of  variables.  For  example,  the  structural  equation  F  = 
aX-\-UY  describes  a  physical  process  by  which  Nature  exam¬ 
ines  the  values  of  X  and  Uy  and  assigns  the  value  aX  Uy 
to  variable  F^ .  If  the  model  specification  accurately  reflects 
the  data  generating  mechanism,  it  is  capable  of  answering 
all  causally  related  queries  regarding  the  model  variables,  in¬ 
cluding  questions  of  prospective  and  introspective  counter- 
factuals^.  In  this  paper,  we  focus  on  linear  SEMs. 

Consider  a  set  of  observed  variables,  yi,  y2,  •••,  A  linear 
SEM  consists  of  a  set  of  equations  of  the  form, 

^The  equal  sign  in  structural  equations  is  not  symmetric.  F  = 
aX  +  Uy  does  not  imply  the  structural  equation  X  =  ^{Y  -  Uy) 
because  X  may  be  assigned  its  value  according  to  other  variables  in 
the  model,  not  F  and  Uy. 

^Prospective  counterfactual  queries  are  queries  of  the  form, 
“What  value  would  F  take  if  X  were  set  to  xT  Introspective  coun¬ 
terfactual  queries  are  queries  of  the  form,  “Given  that  F  currently 
takes  the  value  of  y,  what  would  have  been  the  value  of  F  if  X  had 
been  xT  Counterfactuals  will  be  discussed  in  more  detail  in  the 
section  on  counterfactuals. 
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Figure  1.  (a)  Model  with  latent  variables  {Q\  and  Q2)  shown 
explicitly  (b)  Same  model  with  latent  variables  summarized 
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Figure  2.  Diagrams  associated  with  Model  2  in  the  text  (a) 
with  latent  variables  shown  explicitly  (b)  with  latent  vari¬ 
ables  summarized 


Model  1. 


yt  =  +  A2,>’2  +  ...  +  KiJn  +  m 


where  yi,  are  the  model  variables,  Aji  is  a  coefficient 
that  conveys  the  strength  of  the  causal  relationship  from  yj 
to  ji,  and  Ui  a  random  error  term  due  to  latent  or  omitted 
factors  that  is  generally  assumed  to  be  normally  distributed. 
If  Aji  =  0,  indicating  zero  direct  influence  of  yj  on  y,  then 
we  usually  omit  the  term  Ajiyj  from  the  equation.  Note  that 
the  notion  of  direct  effect  depends  on  the  set  of  variables  we 
decide  to  include  in  the  model,  with  the  understanding  that 
every  coefficient  Aji  from  yj  to  yt  may  summarize  a  chain  of 
many  microprocesses  whose  precise  nature  remains  outside 
the  model.  Throughout  the  paper  we  often  use  distinct  letters 
(e.g.  a,  h,  c)  in  place  of  Aji  for  the  coefficients.  For  example, 
in  Model  1  shown  below,  the  coefficient  Acs  is  given  the 
label  b  while  coefficient  Aq^s  is  given  the  label  e. 

Typically,  the  modeler  specifies  the  equations  from  do¬ 
main  knowledge  and  attempts  to  estimate  the  coefficients 
from  data.  For  example,  suppose  we  wish  to  estimate  the 
effect  of  attending  an  elite  college  on  future  earnings  (Figure 
la).  Clearly,  simply  regressing  earnings  on  college  rating 
will  not  give  an  unbiased  estimate  of  the  target  effect.  Since 
elite  colleges  are  highly  selective,  students  attending  them 
are  likely  to  have  qualifications  for  high-earning  jobs  prior 
to  attending  the  school.  This  background  knowledge  is  ex¬ 
pressed  in  the  following  model  specification: 


Qi  =  Ui 

C  =  G  '  Qi  +  U2 
Q2  =  c  '  C  +  d  '  Q\  +  U2> 

S  =  b  '  C  e  '  Q2  "I" 

where  Q\  represents  the  individual’s  qualifications  prior  to 
college,  Q2  represents  qualifications  after  college,  C  contains 
attributes  representing  the  quality  of  the  college  attended, 
and  S  the  individual’s  salary.  Both  Qi  and  Q2  may  be  la¬ 
tent  variables,  meaning  they  are  unobserved  and,  therefore, 
not  present  in  the  dataset.  The  path  diagram  for  this  model  is 
depicted  in  Figure  la. 

In  order  to  estimate  the  values  of  the  coefficients  b,  c,  and 
e,  which  convey  the  causal  effect  of  attending  an  elite  college 
on  future  earnings,  the  coefficients  must  have  a  unique  solu¬ 
tion  in  terms  of  the  covariance  matrix,  [cTij] .  The  task  of  find¬ 
ing  this  solution  is  known  as  identification  and  is  discussed 
in  a  later  section.  In  some  cases,  one  or  more  coefficients 
may  not  be  identifiable,  meaning  that  no  matter  the  size  of 
the  dataset,  it  is  impossible  to  obtain  an  unbiased  estimate 
of  these  values.  Indeed,  we  will  see  that  the  coefficients  in 
Model  1  are  not  identified  if  Q\  and  Q2  are  latent. 

However,  if  we  include  the  strength  of  an  individual’s  col¬ 
lege  application.  A,  in  the  model,  as  shown  in  Figure  2a,  we 
obtain  the  following  structural  equations: 
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Model  2. 

2i  =  Ux 

A  =  a  •  Q\  U2 

Q2  =  e  '  Qi  d  '  C  U4 
S  =  C-C  +  f-Q2  +  U5, 

from  which  the  causal  effect  of  attending  an  elite  college  on 
future  salary  is  identifiable,  as  we  will  show. 

The  ability  to  determine  identifiability  directly  from  the 
model  specification  is  a  valuable  feature  of  graphical  mod¬ 
els.  For  example,  it  would  be  a  waste  of  resources  to  specify 
the  structure  in  Model  1  and  gather  data  only  to  find  that  the 
parameter  of  interest  is  not  identified.  The  tools  provided  in 
subsequent  sections  will  allow  modelers  to  determine  imme¬ 
diately  from  the  path  diagram  that  the  effect  of  attending  an 
elite  college  on  future  salary,  c  df,  is  not  identified  using 
Model  1  but  is  identified  using  Model  2.  Further,  we  will 
also  be  able  to  determine,  again  by  inspection,  that  c  df 
equals  jSscA,  and  that  the  regression  coefficient  JSsa.cQi  van¬ 
ishes,  which  can  be  used  to  test  whether  the  specification  of 
Model  2  is  compatible  with  the  data.  Most  importantly,  these 
tools  will  be  applicable  to  far  more  complex  models  where 
questions  of  identifiability  and  testable  implications  are  near 
impossible  to  determine  by  hand  or  even  by  standard  soft¬ 
ware. 

Path  Diagrams  and  Graphs 

Path  diagrams  were  introduced  by  Sewell  Wright  (1921), 
who  aimed  to  estimate  causal  influences  from  statistical  data 
on  animal  breeding.  Today,  SEM  is  generally  implemented 
in  software^,  and,  as  a  result,  when  users  experience  unex¬ 
pected  behavior  (due  to  unidentified  parameters,  for  exam¬ 
ple)  they  are  often  at  a  loss  as  to  how  to  proceed^.  For  the 
remainder  of  this  section,  we  will  review  the  basics  of  path 
diagrams  and  provide  users  with  simple,  intuitive  tools  to  re¬ 
solve  questions  of  identification,  goodness  of  fit,  and  more 
using  graphical  methods. 

The  path  diagram  or  causal  graph^  of  an  SEM  can  be  eas¬ 
ily  drawn  from  the  equations  in  the  model.  The  vertices  rep¬ 
resent  model  variables,  and  for  each  equation, 

yi  =  ^uy I  ^nyi  + ...  +  -1-  Ui, 

arrows  are  drawn  from  the  variables  in  yj  to  yt  whenever 
Aji  9^  0.  Each  arrow,  therefore,  is  associated  with  a  coeffi¬ 
cient  in  the  SEM,  which  we  will  label  as  its  path  coefficient. 
The  error  terms,  w/,  are  usually  not  represented  in  the  graph. 
However,  a  bidirected  arc  between  two  variables,  yt  and  yj, 
indicates  that  their  corresponding  error  terms,  Ui  and  uj,  may 
be  statistically  dependent  while  the  lack  of  a  bidirected  arc 
indicates  that  the  error  terms  are  independent.  An  edge  is 


defined  to  be  either  an  arrow  or  a  bidirected  arc.  Figure  la  is 
a  path  diagram  for  Model  1  while  Figure  2a  is  a  path  diagram 
for  Model  2. 

If  an  arrow,  called  (X,  Y),  exists  from  X  to  F  we  say  that 
X  is  a  parent  of  Y.  If  there  exists  a  sequence  of  arrows  all  of 
which  are  directed  from  X  to  F  we  say  that  X  is  an  ancestor 
of  F.  If  X  is  an  ancestor  of  F  then  F  is  a  descendant  of  X. 
Finally,  the  set  of  nodes  connected  to  F  by  a  bidirected  arc 
are  called  the  siblings  of  F. 

A  path  between  X  to  F  is  a  sequence  of  edges,  connecting 
the  two  vertices.  A  path  may  go  either  along  or  against  the 
direction  of  the  arrows.  A  directed  path  from  X  to  F  is  a  path 
consisting  only  of  arrows  pointed  towards  F. 

A  graph  is  acyclic  if  it  does  not  contain  any  cycles,  a  di¬ 
rected  path  that  begins  and  ends  with  the  same  node.  A  graph 
is  cyclic  if  it  contains  a  cycle.  A  model  in  which  the  causal 
graph  is  acyclic  is  called  recursive  while  models  with  cyclic 
graphs  are  called  non-recursive. 

For  conciseness  and  clarity,  latent  variables  will  not  be 
included  in  the  model  specification  or  depicted  as  nodes  in 
the  graph.  Instead,  their  effect  will  be  summarized  by  the 
correlation  they  induce  on  error  variables  as  represented  in 
the  diagram  by  bidirected  arcs.  It  was  shown  by  Verma 
(1993)  that  one  can  always  summarize  a  sequence  of  inter¬ 
connected  latent  variables  in  the  form  of  pairwise  correla¬ 
tions.  For  example,  the  effect  of  the  latent  variables  in  Fig¬ 
ures  la  and  2a  are  summarized  by  Figures  lb  and  2b  respec¬ 
tively.  We  see  that  the  effect  of  College  on  Salary  in  Fig¬ 
ure  la  is  now  summarized  by  the  coefficient  a  in  Figure  lb. 
Similarly,  the  bidirected  arc  between  C  and  S  (representing 
the  correlation  of  the  error  terms  of  C  and  5)  in  Figure  lb 
summarizes  the  correlation  between  C  and  S  due  to  the  path 
C  Q\  ^  Q2  ^  S . 

Causal  Effects 

Let  n  =  {tti,  7:2, ...,  TT^}  be  the  set  of  directed  paths  from  X 
to  F  and  pt  be  the  product  of  the  path  coefficients  along  path 
71  i.  The  total  ejfect  of  X  on  F  is  often  defined  as  the  2/  Pi 
(Bollen,  1989).  For  example,  in  Figure  3a,  the  total  effect  of 
X  on  F  is  (2  •  Z?  -I- 

The  rational  for  this  additive  formula  and  its  extension  to 
non-linear  systems  can  best  be  seen  if  we  define  the  causal 
effect  of  X  on  F  as  the  expected  change  in  F  when  X  is  as¬ 
signed  to  different  values  by  intervention,  as  in  a  randomized 
experiment.  The  act  of  assigning  a  variable  X  to  the  value 

^Common  software  packages  include  AMOS  (Arbuckle,  2005), 
EQS  (Bender,  1989),  LISREL  (Joreskog  and  Sorbom,  1989),  and 
MPlus  (Muthen  and  Muthen,  2010)  among  others. 

Kenny  and  Milan  (2011)  write,  “Identification  is  perhaps  the 
most  difficult  concept  for  SEM  researchers  to  understand.  We  have 
seen  SEM  experts  baffled  and  bewildered  by  issues  of  identifica¬ 
tion.” 

^We  use  both  terms  interchangeably. 
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X  W  Y 


Figure  3.  Models  depicting  interventions  (a)  Original  model 
(b)  After  intervening  on  X  (c)  After  intervening  on  X,  Z,  and 


X  is  represented  by  removing  the  structural  equation  for  X 
and  replacing  it  with  the  equality  X  =  x.  This  replacement 
dislodges  X  from  its  prior  causes  and  ensures  that  variation 
between  X  and  Y  reflects  causal  paths  from  Z  to  T  only. 

The  expected  value  of  a  variable,  T,  after  X  is  assigned  the 
value  X  by  intervention  is  denoted  E{Y\doiX  =  x)],  and  the 
causal  effect  of  Z  on  F  is  defined  as 

E[Y\do{X  =  X  +  1)]  -  E[Y\do{X  =  x)],  (15) 

where  x  is  some  reference  point.  In  non-linear  systems,  the 
effect  will  depend  on  the  reference  point  but  in  the  linear 
case,  X  will  play  no  role  and  we  can  replace  (15),  with  the 
derivative, 

^E[Y\do{X  =  X)].  (16) 

Consider  again  Models  1  and  2  with  C  a  binary  variable 
taking  value  1  for  elite  colleges  and  0  for  non-elite  colleges. 
For  defining  the  total  effect  of  attending  an  elite  college  on 
salary,  we  would  hypothetically  assign  each  member  of  the 
population  to  an  elite  college  and  observe  the  average  salary, 
E{S\do{C  =  1)].  Then  we  would  rewind  time  and  assign 
each  member  to  a  non-elite  college,  observing  the  new  aver¬ 


age  salary,  E{S\do{C  =  0)].  Intuitively,  the  causal  effect  of 
attending  an  elite  college  is  the  difference  in  average  salary, 

E[S\do{C  =  1)1  -  E[S\do{C  =  0)]. 

The  above  operation  provides  a  mathematical  procedure  that 
mimics  this  hypothetical  (and  impossible)  experiment  using 
aSEM. 

In  linear  systems,  this  “interventional”  definition  of  causal 
effect  coincides  with  the  aforementioned  “path-tracing”  def¬ 
inition  as  can  be  seen  using  Figure  3  a  and  its  corresponding 
structural  equations: 

Model  3. 

Z=Uz 
X  =  cZ+Ux 
W  =  aX+Uw 

Y  =  dZ  +  bW +  eX+UY 

Using  the  do  operation  we  obtain  the  new  set  of  equations: 

Z=Uz 
X  =  x 

W  =  aX+Uw 

Y  =  dZ  +  bW +  eX+UY 

The  corresponding  path  diagram  is  displayed  in  Figure  3b. 
(Notice  that  paths  between  X  and  F  due  to  common  causes, 

Z,  have  been  cut,  and  as  a  result,  all  paths  between  X  and  F 
now  reflect  the  causal  effect  from  Z  to  F  only.) 

Recalling  that  we  assume  model  variables  have  been  stan¬ 
dardized  to  mean  0  and  variance  1  so  that  E{Ui\  =0  for  all  /, 
we  see  that  setting  Z  to  x  gives  the  following  expectation  for 
F: 

E[Y\doiX  =  x)\=E[d-Z  +  b-W  +  e-X+UY] 

=  d  •  E[Z]  +  b  •  E[W]  +  e-x  +  E[Uy] 

=  d  •  E[Uz]  +  b  •  E[a  •  x  -r  Uw]  +  e  •  x  +  E[Uy] 
=^0-\-b'Ci'X-\-b'  -l-  ^  •  X  -l-  0 

=  Z7-a-x-l-0-l-^-x-l-0 
=  b  '  a  '  X  +  e  '  X. 

As  a  result, 

E[Y\do(X  =  X  -r  1)]  -  E[Y\do{X  =  x)]=ab  +  e  (17) 

for  all  X,  aligning  the  two  definitions.  Note,  moreover,  that 
this  equality  holds  even  when  the  coefficients,  a,  b,  and  e,  are 
not  identified  (e.g.  if  the  U  terms  are  correlated). 

In  many  cases,  we  may  be  interested  in  the  direct  ejfect 
of  X  on  F.  The  term  “direct  effect”  is  meant  to  quantify  an 
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(a)  (b) 


X  o  Y  X  „  Y 


Figure  4.  Models  illustrating  Wright’s  rules  of  path  tracing 


effect  that  is  not  mediated  by  other  variables  in  the  model  or, 
more  accurately,  the  sensitivity  of  Y  to  changes  in  X  while 
all  other  factors  in  the  analysis  are  held  fixed  (Pearl,  2000). 

“Holding  all  other  factors  fixed”  can  be  simulated  by  as¬ 
signing  all  variables  other  than  X  and  Y  by  intervention  to 
arbitrary  set  of  reference  values^.  (Like  the  total  effect,  in 
linear  systems,  the  direct  effect  does  not  change  with  respect 
to  the  reference  values.)  Doing  so  severs  all  causal  links  in 
the  model  other  than  those  leading  into  Y.  As  a  result,  all 
links  from  X  ioY  other  than  the  direct  link  will  be  severed. 
For  example.  Figure  3c  shows  the  path  diagram  of  Model  3 
after  intervention  on  all  variables  other  than  X  and  Y. 

Now,  the  direct  effect  of  X  on  F  can  be  defined  as 

E[Y\do{X  =  x-\-l,S  =  s)]-  E[Y\do(X  =  x,S  =  s)l 

where  5  is  a  set  containing  all  model  variables  other  than  X 
and  F  and  {xU  s}  3.  set  of  reference  values.  It  is  not  hard  to 
show  that,  in  linear  models,  the  direct  effect  is  equal  to  the 
path  coefficient  from  X  to  F.  It  provides,  in  fact,  a  formal  in¬ 
terventional  justification  for  associating  the  path  coefficient 
with  the  notion  of  “direct  effect”,  and  permits  us  to  extend 
this  notion  to  non-linear  models. 

Finally,  in  linear  models,  the  effect  of  X  on  F  mediated  by 
W  is  equal  to  the  sum  of  the  product  of  coefficients  associated 
with  directed  paths  from  X  to  F  that  go  through  W.  In  Figure 
3a,  we  see  that  this  effect  is  equal  to  ab.  For  a  non-linear 
and  non-parametric  extension  of  this  definition,  see  indirect 
ejfect  in  Pearl  (2000). 

Wright’s  Path  Tracing  Rules 

The  earliest  usage  of  graphs  in  causal  analysis  can  be 
found  in  Sewell  Wright’s  1921  paper,  “Correlation  and  Cau¬ 
sation”.  This  seminal  paper  gives  a  method  by  which  the 
covariance  ctyx  of  any  two  variables  in  a  recursive  model 
can  be  expressed  as  a  polynomial  over  a  subset  of  the  model 
coefficients.  In  this  way,  Wright’s  equations  characterize  the 
relationship  between  the  model  coefficients  and  the  covari¬ 
ance  matrix  and,  subsequently,  provide  an  algebraic  repre¬ 
sentation  of  the  identification  problem.  A  coefficient  is  iden¬ 
tified  if  and  only  if  it  can  be  solved  in  terms  of  the  elements 
of  the  covariance  matrix  using  Wright’s  equations^. 


Figure  5.  Model  illustrating  the  rules  of  d-separation 


Wright’s  method  consists  of  equating  the  standardized  co- 
variance  (Tyx  =  Pyx  between  any  pair  of  variables  to  the  sum 
of  products  of  path  coefficients  and  error  covariances  along 
certain  paths  between  X  and  F.  Let  H  =  [7i\,7i2,  de¬ 
note  the  paths  between  X  and  F  that  do  not  trace  colliding 
arrowheads,  i.e.  or<-^<-^,  and  let  pi  be  the 

product  of  path  coefficients  along  path  tt/.  (We  call  nodes 
where  colliding  arrowheads  meet  a  collider,  e.g.  Z  in  Figure 
6a  and  C  in  Figure  5.)  Then  the  covariance  between  variables 
X  and  F  is  2/  Pi-  Foi*  example,  if  we  wish  to  express  the  co- 
variance  of  C  and  E  in  Figure  5,  we  sum  the  product  of  the 
coefficients  along  paths  C<^E^A^E,C<^A^E,  and 
C  ^  D  ^  E,  giving  o-qe  =  b-a-g-\-c-g-\-d-h.  However,  we  do 
not  include  the  coefficients  along  C  ^  D  B  ^  E  because 
it  traces  a  collider.  For  models  represented  by  the  diagram  in 
Figure  4a,  we  have  ctyx  =  a be  while  ctyx  =  a  +  be  +  Cyx 
for  the  diagram  in  Figure  4b. 

To  express  partial  covariances,  correlations,  or  regression 
coefficients  in  terms  of  path  coefficients  we  first  apply  Equa¬ 
tions  9-14  and  then  use  Wright’s  tracings  rules  for  each  co- 
variance  term.  For  example,  reducing  JSyx.z  for  the  model 
represented  by  Figure  4a  can  be  done  as  follows: 

_  cty  Pyx  -  PyzPzx 

PYX.Z  - - ; - ^ - 

CTx  1  -  Pxz 

I  (a-\-  be)  -  (c  ab)(b) 

"  T  l-b^ 

a  -Y  be  -  be  -  ab^ 

~  1 
a{\  -  b^) 

~  l-b^ 

=  a 

D-Separation 

As  mentioned  previously,  when  the  conditioning  set  be¬ 
comes  large,  applying  the  recursive  formula  of  Equations 

^In  footnote  12  we  give  an  example  demonstrating  that  “holding 
all  other  factors  fixed”  cannot  be  simulated  using  conditioning  but 
instead  must  invoke  intervention. 

^However,  as  Wright’s  equations  are  non-linear,  it  can  be  very 
difficult  to  analyze  the  identification  of  large  models  by  studying 
solutions  for  the  system  of  equations. 
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Height  Education 
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Figure  6.  Examples  illustrating  conditioning  on  a  collider 


12-14  can  become  complex.  Vanishing  partial  correlations, 
however,  can  be  readily  identified  from  the  path  diagram  us¬ 
ing  a  criterion  called  d-separation  (Pearl,  1988)^. 

The  idea  of  d-separation  is  to  associate  “correlation” 
with  “connectedness”  in  the  graph,  and  independence 
with  “separation”.  The  only  twist  on  this  simple  idea 
is  to  define  what  we  mean  by  “connected  path”,  given 
that  we  are  dealing  with  a  system  of  directed  arrows  in 
which  some  vertices  (those  residing  in  the  conditioning 
set,  Z)  correspond  to  variables,  whose  values  are  measured 
precisely.  To  account  for  the  orientations  of  the  arrows  we 
use  the  terms  “d-separated”  and  “d-connected”  (d  denotes 
“directional”). 

Rule  1:  X  and  Y  are  d-separated  if  there  is  no  active  path 
between  them. 

By  “active  path”,  we  mean  a  path  that  can  be  traced  with¬ 
out  traversing  a  collider.  If  no  active  path  exists  between  X 
and  Y  then  we  say  that  X  and  Y  are  d-separated.  As  we  can 
see  from  Wright’s  rules,  pxY  vanishes  when  X  and  Y  are  d- 
separated. 

When  we  measure  a  set  Z  of  variables,  and  take  their 
values  as  given,  the  partial  covariances  of  the  remaining 
variables  changes  character;  some  correlated  variables 
become  uncorrelated,  and  some  uncorrelated  variables 
become  correlated.  To  represent  this  dynamic  in  the  graph, 
we  need  the  notion  of  “partial  d-connectedness”  or  more 
concretely,  “d-connectedness  conditioned  on  a  set  Z  of 
measurements”. 

Rule  2:  X  and  Y  are  d-connected,  conditioned  on  a  set  of  Z 
nodes,  if  there  is  a  collider- free  path  between  X  and  Y  that 
traverses  no  member  of  Z.  If  no  such  path  exists,  we  say 
that  X  and  Y  are  d-separated  by  Z  or  we  say  that  every  path 


Temperature 


Ice  Cream  Sales  Water  Activities 


Y 

Drownings 

Figure  7.  Diagram  illustrating  why  Ice  Cream  Sales  and 
Drowning  are  uncorrelated  given  Temperature  and/or  Water 
Activities 


between  X  and  Y  is  “blocked”  by  Z. 

A  common  example  used  to  show  that  correlation  does  not 
imply  causation  is  the  fact  that  ice  cream  sales  are  correlated 
with  drowning  deaths.  When  the  weather  gets  warm  people 
tend  to  both  buy  ice  cream  and  play  in  the  water,  resulting 
in  both  increased  ice  cream  sales  and  drowning  deaths. 
This  causal  structure  is  depicted  in  Figure  7.  Here,  we  see 
that  Ice  Cream  Sales  and  Drownings  are  d-separated  given 
either  Temperature  or  Water  Activities.  As  a  result,  if  we 
only  consider  days  with  the  same  temperature  and/or  the 
same  number  of  people  engaging  in  water  activities  then  the 
correlation  between  Ice  Cream  Sales  and  Drownings  will 
vanish. 

Rule  3:  If  a  collider  is  a  member  of  the  conditioning  set  Z, 
or  has  a  descendant  in  Z,  then  the  collider  no  longer  blocks 
any  path  that  traces  it. 

According  to  Rule  3,  conditioning  can  unblock  a  blocked 
path  from  X  to  Y.  This  is  due  to  the  fact  that  conditioning 
on  a  collider  or  its  descendant  opens  the  fiow  of  information 
between  the  parents  of  the  collider.  For  example,  X  and  Y 
are  uncorrelated  in  Figure  6a.  However,  conditioning  on  the 
collider,  Z,  correlates  X  and  Y  giving  pxy.z  ^  0-  This  phe¬ 
nomenon  is  known  Berkson’s  paradox  or  “explaining  away”. 
To  illustrate,  consider  the  example  depicted  in  Figure  6b. 
It  is  well  known  that  higher  education  often  affords  one  a 
greater  salary.  Additionally,  studies  have  shown  that  height 
also  has  a  positive  impact  on  one’s  salary.  Fet  us  assume  that 
there  are  no  other  determinants  of  salary  and  that  Height  and 


^See  also  (Hayduk  et  al.,  2003)  and  (Mulaik,  2009)  for  an  intro¬ 
duction  to  d-separation  tailored  to  SEM  practitioners. 
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Education  are  uncorrelated.  If  we  observe  an  individual  with 
a  high  salary  that  is  also  short,  our  belief  that  the  individual  is 
highly  educated  increases.  As  a  result,  we  see  that  observing 
Salary  correlates  Education  and  Height.  Similarly,  observing 
an  effect  or  indicator  of  salary,  say  the  individual’s  Eerrari, 
also  correlates  Education  and  Height. 

The  fact  that  ctyx.z  ^  0  when  ctyx  =  0  and  Z  a  common 
child  of  X  and  Y  can  also  be  illustrated  using  Wright’s  path 
tracing  rules.  Consider  Eigure  6a  where  Z  is  a  common  effect 
of  X  and  Y.  We  have  (Tyx  =  0  and,  using  Equation  10, 

_  cr  YZO' zx 

O' YX.Z  -  o-Yx - ^ — 


=  -ab. 

When  a  and  h  are  non-zero  we  have  an  algebraic  confir¬ 
mation  of  our  intuition  from  the  salary  example  that  X  and  Y 
are  uncorrelated  marginally,  but  becoming  correlated  when 
we  condition  on  Z. 

Berkson’s  paradox  implies  that  paths  containing  colliders 
can  be  unblocked  by  conditioning  on  colliders  or  their  de¬ 
scendants.  Let  tt'  be  a  path  from  Z  to  F  that  traces  a  collider. 
If  for  each  collider  on  the  path  tt',  either  the  collider  or  a 
descendant  of  the  collider  is  in  the  conditioning  set  Z  then  n' 
is  unblocked  given  Z.  The  exception  to  this  rule  is  if  Z  also 
contains  a  non-collider  along  the  path  n'  in  which  case  X  and 
F  are  still  blocked  given  Z.  For  example,  in  Figure  5  the  path 
E^C<— A^E'is  unblocked  given  C  ox  D.  However,  it  is 
blocked  given  {A,  C}  or  {A,  D}. 

The  above  three  rules  characterize  d- separation  while  the 
following  theorem  makes  explicit  the  relationship  between 
partial  correlation  and  d- separation. 

Theorem  1.  Let  G  be  the  path  diagram  for  a  SEM  over  a  set 
of  variables  V.  If  X  e  V  and  Y  e  V  are  d-separated  given 
a  set  Z  G  V  in  the  path  diagram,  G,  then  (Txy.z  =  Pxy.z  = 
PxY.z  =  Pyx.z  =  0- 

If  X  and  F  are  d-connected  given  Z  then  is  generally 
not  equal  to  zero  but  may  equal  zero  for  particular  parame- 
terizations.  For  example,  it  is  possible  that  the  values  of  the 
coefficients  are  such  that  the  unblocked  paths  between  X  and 
F  perfectly  cancel  one  another. 

We  use  the  diagram  depicted  in  Figure  5  as  an  example  to 
illustrate  the  rules  of  d-separation.  In  this  example,  F  is  d- 
separated  from  £"  by  A  and  C.  However,  C  is  not  d-separated 
from  £■  by  A  and  D  since  conditioning  on  D  opens  the  col¬ 
lider  C  ^  D  B.  Finally  C  is  d-separated  from  E  by 
conditioning  on  A,  Z),  and  B. 

To  illustrate  the  power  and  applicability  of  d-separation, 
we  pose  and  answer  two  questions  regarding  regression  on 
the  model  depicted  in  Figure  5. 


X  - >  M  — >  Y 

Figure  8.  The  fact  that  M  d-separates  X  from  F  implies 
Pyx.m  =  0 


(i)  Suppose  we  regress  B  on  all  other  variables, 

B  =  Pe  '  E  -Y  Pj)  '  D  +  Pa  '  ^  ^  I^c  '  C  -y  Pe  '  E  +  eb, 

which  regression  coefficients  will  be  0? 

We  might  naively  expect  that  all  regression  coefficients 
associated  with  variables  that  are  not  connected  to  B, 
Pa^  Pc  5  and  Pe,  will  vanish.  However,  D  and  E  are 
colliders  so  regressing  on  them  opens  the  path  to  A  and 
C.  Therefore,  the  only  vanishing  regression  coefficient 
i^Pe. 

(ii)  Suppose  we  regress  £"  on  A  and  B.  Which  variable  can 
be  added  to  the  regression  without  changing  the  coeffi¬ 
cient  of  B1 

Since  F  and  C  are  d-separated  from  B  given  A  they  can 
be  added  to  the  regression  without  changing  the  coeffi¬ 
cient  of  B.  However,  adding  D  is  liable  to  change  the 
regression  coefficient  of  B.  Equivalent  regressor  sets 
will  be  discussed  in  more  detail  in  a  later  section. 

Some  SEM  researchers  regard  the  resilience  and  stability 
of  regression  coefficients  to  additional  regressors  to  be  a  sign 
of  robustness^.  As  can  be  seen  from  the  above  example,  sen¬ 
sitivity  to  adding  regressors  has  little  to  do  with  misspecifi- 
cation;  whether  or  not  a  regression  coefficient  changes  when 
regressors  are  added  is  dependent  on  the  structure  of  the  data- 
generating  model. 

To  emphasize  this  point,  we  demonstrate  an  extreme  case 
of  sensitivity  in  a  well- specified  model.  Those  familiar  with 
the  concept  of  exogeneity  will  recognize  that,  in  Figure  8,  X 
is  uncorrelated  with  the  error  term  of  F.  As  a  result,  simply 

^According  to  Lu  and  White  (2014),  “A  common  exercise  in 
empirical  studies  is  a  ‘robustness  check,’  where  the  researcher  ex¬ 
amines  how  certain  ‘core’  regression  coefficient  estimates  behave 
when  the  regression  specification  is  modified  by  adding  or  remov¬ 
ing  regressors.”  “Of  the  98  papers  published  in  The  American  Eco¬ 
nomic  Review  during  2009,  76  involve  some  data  analysis.  Of 
these,  23  perform  a  robustness  check  along  the  lines  just  described, 
using  a  variety  of  estimators.”  In  a  more  recent  survey  of  non- 
experimental  empirical  work,  Oster  (2013)  finds  that  75%  of  2012 
papers  published  in  The  American  Economic  Review,  Journal  of  Po¬ 
litical  Economy,  and  Quarterly  Journal  of  Economics  explored  the 
sensitivity  of  results  to  varying  control  sets.  Since  this  practice  is 
conducted  to  help  diagnose  misspecfication,  the  answer  to  Question 
5  is  essential  for  discerning  whether  an  altered  coefficient  indicates 
mis  specification  or  not. 
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regressing  F  on  X  will  give  an  unbiased  estimate  of  the  total 
effect  of  X  on  F.  However,  if  we  add  the  mediator,  M,  to  the 
regression  then  the  coefficient  for  X,  jSyx.u^  vanishes. 

D- separation  formalizes  the  intuition  that  paths  carry  as- 
sociational  information  between  variables  and  that  this  flow 
of  information  can  be  blocked  by  conditioning.  This  intuition 
drives  many  of  the  results  in  identification,  model  testing, 
and  other  problems  that  will  be  discussed  in  subsequent  sec¬ 
tions,  making  d-separation  an  essential  component  of  graph¬ 
ical  modeling. 

We  conclude  this  section  by  noting  that  d-separation  im¬ 
plies  vanishing  partial  correlation  in  both  recursive  and  non¬ 
recursive  linear  models  (Spirtes,  1995).  Further,  all  vanish¬ 
ing  partial  correlations  implied  by  a  SEM  can  be  obtained  us¬ 
ing  d-separation  (Pearl,  2000).  Finally,  in  models  with  inde¬ 
pendent  error  terms,  these  vanishing  partial  correlations  rep¬ 
resent  all  of  the  model’s  testable  implications  (Pearl,  2000). 

Identification 

A  model  parameter  is  identified  if  it  is  uniquely  deter¬ 
mined  from  the  covariance  matrix.  If  every  parameter  in  the 
model  is  identified  then  the  model  is  said  to  be  identified.  If 
there  is  at  least  one  unidentified  parameter  than  the  model 
is  not  identified  or  unidentified^^.  For  example,  consider  the 
model  represented  by  Figure  4a.  Using  Wright’s  equations 
we  obtain  the  following  equalities: 


(TxY  =  a be 

(18) 

CTzx  =  b 

(19) 

ctzy  =  c  ba 

(20) 

As  a  result,  b  is  uniquely  identified  with  b  =  ctzx-  Now,  sub¬ 
stituting  cTzx  for  b  into  the  other  two  equations  we  obtain: 

ctxy  =  O' zxc  (21) 

(JzY  =  c  cr zxci  (22) 

Both  a  and  c  are  identified  with  a  =  ctxy  -  o'zxO'zy  and 
c  =  (TzY  -  ^zx^xY-  We  see  that  all  model  parameters  have 
unique  solutions  in  terms  of  the  covariance  matrix,  and  hence 
the  model  is  identified. 

In  contrast,  the  model  depicted  in  Figure  9  is  not  iden¬ 
tified.  Using  Wright’s  rules  we  obtain  a  single  equation: 
^  +  CxY  -  ctxy-  Since  there  are  infinite  values  for  a  and 
CxY  that  satisfy  this  equation,  neither  parameter  is  identified 
and  the  model  is  not  identified. 

Many  SEM  researchers  identify  structural  models  by  sub¬ 
mitting  the  specification  and  data  to  an  SEM  program,  which 
attempts  to  minimize  a  fitting  function.  If  there  is  difficulty 
in  this  computation,  the  program  warns  that  the  model  may 
not  be  identified.  While  convenient,  there  are  disadvantages 


Figure  9.  A  diagram  representing  an  unidentified  model 


to  using  typical  SEM  software  to  determine  model  identifia- 
bility.  Kenny  and  Milan  (201 1)  list  the  following  drawbacks: 

(i)  If  poor  starting  values  are  chosen,  the  program  could 
mistakenly  conclude  the  model  is  not  identified  when 
in  fact  it  may  be  identified. 

(ii)  The  program  is  not  very  helpful  in  indicating  which  pa¬ 
rameters  are  not  identified. 

(iii)  Most  importantly,  the  program  only  gives  an  answer 
after  the  researcher  has  taken  the  time  to  collect  data. 

We  add  two  additional  drawbacks  to  this  list: 

(iv)  If  poor  starting  values  are  chosen,  the  program  may  exit 
with  parameter  values  at  a  local  minimum  of  the  fitting 
function  rather  than  the  global  minimum,  giving  incor¬ 
rect  values  to  the  parameters. 

(v)  If  even  one  coefficient  is  not  identifiable,  most  soft¬ 
wares^  are  unable  to  identify  any  of  the  path  coeffi¬ 
cients. 

In  this  section,  we  give  graphical  criteria  that  allows  the 
modeler  to  determine  the  identifiability  of  individual  param¬ 
eters  from  mere  inspection  of  the  path  diagram.  Eurther,  our 
criteria  also  give  the  values  of  the  identified  parameters  in 
terms  of  the  entries  of  the  covariance  matrix.  While  these 
methods  are  not  complete  in  the  sense  that  they  may  not  be 
able  to  identify  every  coefficient  that  is  identifiable,  they  sub¬ 
sume  the  identifiability  rules  in  the  existing  SEM  literature, 
including  the  well  known  recursive  and  null  rules  (Bollen, 
1989)  and  the  regression  rule  (Kenny  and  Milan,  2011). 

A  Simple  Criterion  for  Identifying  Individual  Coefficients 

In  Eigure  4a,  Z  is  a  common  cause  of  both  X  and  F  and  is 
often  called  a  confounder.  In  epidemiology  and  other  areas, 
it  is  well  known  that,  in  order  to  estimate  a  using  regres¬ 
sion,  we  must  “adjust  for”  Z  by  including  it  in  the  regression 
equation.  However,  common  causes  are  not  always  available 

^^Many  authors  also  use  the  term  “under-identified”.  This  term 
can  be  confusing  because  it  suggests  models  that  are  not  identifiable 
have  no  testable  implications.  This  is  not  the  case. 

According  to  Kenny  and  Milan  (2011),  AMOS  is  the  only  pro¬ 
gram  that  attempts  to  identify  parameters  when  the  model  is  under¬ 
identified. 
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(C) _ 

G 

z  ^  w 


X  - >  Y 


a 

Figure  10.  Diagrams  illustrating  identification  by  the  single¬ 
door  criterion  (a)  a  is  identified  by  adjusting  for  Z  ovW  (b) 
The  graph  Ga  used  in  the  identification  of  a  (c)  a  is  identified 
by  adjusting  for  Z  (or  Z  and  W)  but  not  W  alone 


for  measurement  or  adjustment.  Instead,  proxies  are  often 
used.  For  instance,  in  Figure  10a,  W  might  well  serve  to  re¬ 
place  Z  in  the  adjustment  and,  thus,  de confound  the  relation¬ 
ship  between  X  and  Y.  The  question  arises,  how  can  we,  in 
general,  determine  whether  a  set  of  variables  is  adequate  for 
adjustment  when  attempting  to  identify  a  given  coefficient? 
In  other  words,  when  would  the  regression  coefficient  of  X  in 
the  regression  of  F  on  Z  and  Z  be  equal  to  the  path  coefficient 
from  X  to  F?  The  following  criterion,  called  single-door,  al¬ 
lows  the  modeler  to  answer  this  question  by  inspection  of  the 
path  diagram. 

Theorem  2.  (Pearl,  2000)  (Single-door  Criterion)  Let  G  be 
any  recursive  causal  graph  in  which  a  is  the  path  coefficient 
associated  with  link  X  ^  Y,  and  let  Ga  denote  the  diagram 
that  results  when  X  ^  Y  is  deleted  from  G.  The  coefficient  a 
is  identifiable  if  there  exists  a  set  of  variables  Z  such  that  (i) 
Z  contains  no  descendant  ofY  and  (ii)  Z  d-separates  X  from 
Y  in  Ga-  IfZ  satisfies  these  two  conditions,  then  a  is  equal  to 
the  regression  coefficient  fiyx.z-  Conversely,  ifZ  does  not  sat¬ 
isfy  these  conditions,  then  fijxz  Is  not  a  consistent  estimand 
of  a  (except  in  rare  instances  of  measure  zero). 

In  Figure  10a,  we  see  that  Z  blocks  the  spurious  path 
X  Z  ^  W  ^  Y  and  X  is  d-separated  from  F  by  Z  in 
Figure  10b.  Therefore,  a  =  fiyxz-  This  is  to  be  expected 
since  Z  is  a  common  cause  of  X  and  F.  Theorem  2  tells 
us,  however,  that  W  can  also  be  used  for  adjustment  since 
W  also  d-separates  X  from  F  in  Figure  10b,  and  we  obtain 

=  Pyx.w-  Moreover,  we  will  see  in  a  subsequent  section 


z  z 


Figure  11.  Example  showing  that  adjusting  for  a  descendant 
of  F  induces  bias  in  the  estimation  of  a 


that  the  choice  of  W  is  superior  to  that  of  Z  in  terms  of  esti¬ 
mation  power.  Consider,  however.  Figure  10c.  Z  satisfies  the 
single-door  criterion  but  W  does  not.  Being  a  collider,  W  un¬ 
blocks  the  spurious  path,  Z  <—  Z  ^  IF  F,  in  violation  of 
Theorem  2,  leading  to  bias  if  adjusted  for^^.  In  conclusion,  a 
is  equal  to  fiyxz  in  Figures  10a  and  10c.  However,  a  is  equal 
to  fiyx.w  in  Figure  10a  only. 


It  is  well  known  that  estimating  a  using  regression  re¬ 
quires  that  X  be  uncorrelated  with  the  error  term  of  F.  Notice 
that  the  single-door  criterion  gives  the  graphical  conditions 
for  when  conditioning  on  a  set  Z  renders  X  and  the  error  term 
of  F  uncorrelated.  Whenever  X  is  d-separated  from  F  when 
the  edge  Z  ^  F  is  removed  then  Z  must  also  be  d-separated 
from  the  error  term  of  F  in  the  original  graph  since  F  acts  as 
a  collider.  For  example,  in  Figure  1  lb,  Z  is  d-separated  from 
Uy  and  Z  is  d-separated  from  F  when  Z  ^  F  is  removed. 


The  intuition  for  the  requirement  that  Z  not  be  a  descen¬ 
dant  of  F  is  depicted  in  Figures  11a  and  1  lb.  We  typically  do 
not  display  the  error  terms,  which  can  be  understood  as  latent 
causes.  In  Figure  11b,  we  show  the  error  terms  explicitly.  It 
should  now  be  clear  that  F  is  a  collider  and  conditioning  on  Z 
will  create  spurious  correlation  between  Z,  uy,  and  F  leading 
to  bias  if  adjusted  for. 


No  matter  how  complex  the  model,  the  given  single-door 
criterion  gives  us  a  quick  and  reliable  criterion,  sufficient  for 
identification  of  a  structural  parameter  using  regression.  It 
allows  us  to  choose  a  variety  of  conditioning  sets  using  con¬ 
siderations  of  estimation  power,  sample  variability,  cost  of 
measurement  and  more.  Further,  it  is  an  important  tool  that 
plays  a  role  in  the  identification  of  parameters  in  more  elab¬ 
orate  models. 
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(a) 

V  V 

Z - >  X - >  Y 

(b) 


a  conditional  instrument  (given  W) 

(a)  (b) 

Zi  Z2  Zi  Z2 

>  X2  >  Xi  X2 


Y  Y  < — " 

Figure  13.  Diagrams  illustrating  instrumental  sets 


Instrumental  Variables 

It  is  well  known  that  an  instrumental  variable,  Z,  can  be 
used  to  identify  a  path  coefficient,  a,  even  when  there  is  un¬ 
observed  confounding  between  X  and  F,  as  in  Figure  12a. 
The  following  is  a  standard  definition  of  an  instrumental  vari¬ 
able,  adapted  from  Wikipedia  (2014): 

Definition  1.  A  variable  Z  qualifies  as  an  instrumental  vari¬ 
able  for  a  coefficient  from  X  to  Y  if 

(i)  Z  is  correlated  with  X 

(ii)  Z  is  uncorrelated  with  the  error  term  ofY 

Explanation  typically  stops  here  and  the  modeler  has  to 
determine  judgmentally  whether  condition  (ii)  is  satisfied  in 
a  complex  model  containing  multiple  equations  and  error 
terms.  The  major  obstacles  in  such  judgment  are: 

(i)  To  understand  what  is  meant  by  the  “the  error  term”  of 
F  within  a  system  of  many  equations. 

(ii)  To  judge  whether  Z  is  uncorrelated  with  that  “error 
term”  within  a  system  of  correlated  errors. 


As  a  result,  instrumental  variables  are  often  incorrectly  de¬ 
termined.  In  this  section,  we  show  how  to  utilize  the  path 
diagram  to  determine  by  inspection  whether  a  variable  is  an 
instrument.  Bollen  and  Bauer  (2004)  called  such  variables 
“model-implied”  instruments  and  used  algebraic  methods  for 
their  identification.  Kyono  (2010),  on  the  other  hand,  used 
graphical  methods  similar  to  those  described  here^^.  Finally, 
we  also  show  how  to  find  conditional  instrumental  variables 
and  instrumental  sets  (Brito  and  Pearl,  2002a),  allowing  us 
to  identify  a  in  Figure  12b  and  both  y  and  a  in  Figure  13a. 

Theorem  3.  A  variable  Z  qualifies  as  an  instrumental  vari¬ 
able  for  coefficient  a  from  X  to  Y  if 

(i)  Z  is  d-separated  from  Y  in  the  subgraph  Ga  obtained 
by  removing  edge  X  ^  Y  from  G  and 

(ii)  Z  is  not  d-separated  from  X  in  Ga 
When  Z  is  an  instrument  for  a  then 

Z  in  Figure  12a  is  an  example  of  an  instrumental  variable 
since  Z  is  d-separated  from  F  but  still  d-connected  to  X  when 
we  remove  the  edge  associated  with  a.  Using  Wright’s  path 
tracing  rules,  it  is  easy  to  see  that  ^  ^ 

Theorem  3  clarifies  the  circumstances  for  which  Z  quali¬ 
fies  as  an  instrument.  Even  if  the  underlying  structure  is  not 
clear  to  the  modeler,  he  or  she  is  able  to  consider  competing 
explanations  for  the  data  generating  process  and  determine 
immediately  which  of  them  qualifies  Z  as  an  instrument. 

In  Figure  12b,  Z  is  not  an  instrument  because  it  is  d- 
connected  to  F  even  after  deleting  the  edge  from  X  to  F. 
However,  we  can  condition  on  W  to  block  the  spurious  path 
from  Z  to  F  through  W  and  obtain  a  =  Thus,  we  see 

that  in  some  cases,  variables  may  become  instrumental  vari¬ 
ables  by  conditioning  on  other  variables. 

Theorem  4.  (Brito  and  Pearl,  2002a)  A  variable  Z  is  a  con¬ 
ditional  instrumental  variable  given  a  set  W  for  coefficient  a 
from  X  to  Y  if 

(i)  W  contains  only  non-descendants  ofY 

(ii)  W  d-separates  Z  from  Y  in  the  subgraph  Ga  obtained 
by  removing  edge  X  ^  Y  from  G 

( in)  W  does  not  d-separate  Z  from  X  in  Ga 

When  Z  is  a  conditional  instrument  for  a  given  W  then 

_  hzY.W 
Pzx.w' 

It  is  for  this  reason  that  the  direct  effect  cannot  be  defined 
by  conditioning  on  a  mediator  but  must  instead  invoke  intervention 
(Pearl,  2013,  2014b),  as  we  did  earlier. 

Kyono  (2010)  also  released  software  that  implements  d- 
separation,  the  single-door  criterion,  graphical  techniques  for  iden¬ 
tifying  instruments,  and  more.  Much  of  this  functionality  is  also 
implemented  in  DAGitty  (Textor  et  al.,  201 1),  which  includes  a  con¬ 
venient  GUI  for  manipulating  graphs. 


12 


BRYANT  CHEN  AND  JUDEA  PEARL 


Additionally,  it  may  be  possible  to  use  multiple  variables 
as  an  instrumental  set  in  order  to  identify  parameters  when, 
individually,  none  of  the  variables  qualify  as  an  instrument. 
In  Figure  13  a,  neither  Zi  nor  Z2  are  conditional  instruments 
for  the  identification  of  y  or  a.  However,  using  them  simul¬ 
taneously  allows  the  identification  of  both  coefficients.  Us¬ 
ing  Wright’s  equations,  as  we  did  in  the  single  instrumental 
variable  case,  we  have: 


O'ZiY  =  CTZiYiT  +  O-z^X20^ 
0'Z2Y  =  O' Z2Xi7  +  O' Z2X2O 


(a) 


(b) 


As  a  result,  we  are  able  to  obtain  two  linearly  independent 
equations  with  two  unknowns  and  solve  for  y  and  a.  We  call 
a  set  of  variables  that  enables  a  solution  in  this  manner  an 
instrumental  set. 

Definition  2.  {Zi ,  Z2, ...,  Z„}  is  an  instrumental  set/^r  the  co¬ 
efficients  Qfi, ...,  an  associated  with  edges  Xi  7, ...,  Xn  ^  Y 
if  the  following  conditions  are  satisfied. 

(i)  Let  G  be  the  graph  obtained  from  G  by  deleting  edges 
Xi  Y,  ...,Xn  Y.  Then,  Zi  is  d- separated  from  Y  in 
G  for  all  i  c  {1,2,  ...,n}. 

(ii)  There  exists  paths  pi, p2, Pn  such  that  pi  is  a  path 
from  Zi  to  Y  that  includes  edge  Xi  Y  and  if  paths  pi 
and  pj  have  a  common  variable  V,  then  either 

(a)  both  Pi  [Zi  ...V]  and  pj  [V...Y]  point  to  V  or 

(b)  both  pj[Zj...V]  and  pi[V...Y]  point  to  V. 

for  all  i,  j  e  (1,2, ...,  n}  and  i  ^  j. 

The  second  condition  in  Definition  2  can  be  understood 
as  requiring  that  two  paths  pi  and  pj  cannot  be  broken  at  a 
common  variable  V  and  have  their  pieces  swapped  and  rear¬ 
ranged  to  form  two  unblocked  paths.  One  of  the  rearranged 
paths  must  contain  a  collider.  This  condition  is  illustrated  in 
the  example  below. 

Theorem  5.  Let  {Zi,Z2,  ...,Zn}  be  an  instrumental  set  for  the 
coefficients  ai, ...,  associated  with  edges 

Xi  ^  Y,...,Xn  Y. 

Then  the  linear  equations, 

O'ZiY  =  CTZiYiO^l  +  O'z^X20^2  +  •••  +  CTZiX^Q^n 
0'Z2Y  =  0-Z2XiCr\  +  (Tz2A2<^2  +  •••  +  CrZ2Y„Qf„ 

O'ZnY  =  0-ZnXiCr\  +  CrZ„X2<^2  +  •••  +  CrZ^X^CTn, 

are  linearly  independent  for  almost  all  parameterizations  of 
the  model. 


Figure  14.  (a)  Zi  and  Z2  qualify  as  an  instrumental  set  (b)  Zi 
and  Z2  do  not  qualify  as  an  instrumental  set 


Like  Zi  and  Z2  in  Figure  13a,  Zi  and  Z2  in  Figure  14a  qual¬ 
ify  as  an  instrumental  set.  Zi  and  Z2  are  d- separated  from  Y  in 
the  graph  G,  where  the  edges  Xi  ^  Y  and  X2  ^  Y  have  been 
removed.  Additionally,  we  have  p\  -  Z\  Z2  ^  X\  Y 
and  p2  =  Z2  X2  ^  Y.  Using  Wright’s  rules  we  obtain 

o-ZiY  =  aby  =  crz^XiY  +  0  •  a  =  crz,x,r  +  crz,x2a  and 
o-Zjy  =  by +  ca  =  (Tz^xJ  +  o■z■,x^a, 

in  accordance  with  Theorem  5.  Solving  the  equations  identi¬ 
fies  OL  and  y  giving: 

<^Zi7 

r  = - 

<TZiYi 

<^Z2F  CrZ2YiCrZiF 
a  = - 

^ZX2  ^Z2X2^ZxXx 

Notice  that  pi  and  p2  satisfy  the  second  condition  of  Def¬ 
inition  2  because  in  p\,  the  arrow  associated  with  coefficient, 
a,  entering  the  shared  node,  Z2,  is  pointing  at  Z2  while  in 
P2,  the  arrow  associated  with  parameter,  c,  leaving  Z2  is  also 
pointing  at  the  shared  node,  Z2.  As  a  result,  if  the  paths 
Pi  and  p2  are  broken  at  the  common  variable,  Z2,  and  their 
pieces  swapped  and  rearranged,  pi  will  become  a  blocked 
path  due  to  the  collider  at  Z2.  Algebraically,  this  means  that 
(TziF  lacks  the  influence  of  the  path  Z2  ^  X2  ^  Y  and, 
therefore,  does  not  contain  the  term  aca.  ctz^y,  on  the  other 
hand,  contains  the  term  ca  associated  with  the  path.  It  is  in 
this  way  that  condition  (ii)  of  Definition  2  allows  pi  and  pj 
to  share  a  node  while  still  ensuring  linear  independence. 

In  contrast,  consider  Figure  14b.  Here,  Zi  and  Z2  are  not 
an  instrumental  set  for  a  and  y.  Every  path  from  Z2  to  Y  is 
a  “sub-path”  of  a  path  from  Zi  to  7,  which,  using  Wright’s 
rules,  implies  that  the  equation  for  ctz^y  is  not  linearly  inde¬ 
pendent  of  ctz^y  with  respect  to  7’s  coefficients: 

<^ZiF  =  by  ca 

o'Z2Y  =  ciby  +  aca  =  a(by  ca)  =  acrz^Y 
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In  some  cases,  condition  (i)  of  Definition  2  can  be  sat¬ 
isfied  by  conditioning  on  a  set  W.  Brito  and  Pearl  (2002a) 
show  how  conditioning  can  be  used  to  obtain  a  conditional 
instrumental  set.  Due  to  the  more  complex  nature  of  apply¬ 
ing  Wright’s  rules  over  partial  correlations,  we  do  not  cover 
conditional  instrumental  sets  in  this  paper  and  instead  refer 
the  reader  to  (Brito  and  Pearl,  2002a). 

C-Component  Decomposition 

In  this  subsection,  we  show  that  the  question  of  coeffi¬ 
cient  identification  can  be  addressed  using  smaller  and  sim¬ 
pler  sub-graphs  of  the  original  causal  graph.  Further,  in  some 
cases,  the  coefficient  is  not  identified  using  any  methods  con¬ 
sidered  thus  far  on  the  original  graph  but  is  identified  using 
those  methods  on  the  sub-graph. 

A  c-component  in  a  causal  graph  is  a  maximal  set  of 
nodes  such  that  all  nodes  are  connected  to  one  another  by 
paths  consisting  of  bidirected  arcs.  For  example,  the  graph 
in  Figure  14b  consists  of  three  c-components,  {Xi,X2,F}, 
{Z2},  and  {Zi},  while  the  graph  depicted  in  Figure  16  con¬ 
sists  of  a  single  c-component.  Tian  (2005)  showed  that  a 
coefficient  is  identified  if  and  only  if  it  is  identified  in  the 
sub-graph  consisting  of  its  c-component  and  the  parents  of 
the  c-component. 

More  formally,  a  coefficient  from  Z  to  F  is  identified  if 
and  only  if  it  is  identified  in  the  sub-model  constructed  in  the 
following  way: 

(i)  The  sub-model  variables  consist  of  the  c-component  to 
which  Y  belongs,  Cy,  union  the  parents  of  all  variables 
in  that  c-component. 

(ii)  The  structural  equations  for  the  variables  in  Cy  are  the 
same  as  their  structural  equations  in  the  original  model. 

(iii)  The  structural  equations  for  the  parents  simply  equate 
each  parent  with  its  error  term. 

(iv)  If  the  error  terms  of  any  two  variables  in  the  sub-model 
were  uncorrelated  in  the  original  model  then  they  are 
uncorrelated  in  the  sub-model. 

For  example,  the  sub-model  for  the  coefficient  a  from  X 
to  Y  in  Figure  15a  consists  of  the  following  equations: 

Z=Uz 
X  =  aX+Ux 
W  =  bW-\-Uw 
V=Uv 

Y  =  aX-\-dV-\-UY 

Additionally,  puxUy  PUwUy  unrestricted  in  their 
values.  All  other  error  terms  are  uncorrelated. 

It  is  not  clear  how  to  identify  the  coefficient  a  depicted 
in  Figure  15a  using  any  of  the  methods  considered  thus  far. 


(a) 

V  V 

\  h  d 


(b) 


Figure  15.  (a)  Example  illustrating  c-component  decomposi¬ 
tion  (b)  Sub-graph  consisting  of  c-component,  {W,X,  F},  and 
its  parents,  Z  and  V. 


However,  the  sub-graph  for  the  c-component,  {W,X,  F},  de¬ 
picted  in  Figure  15b,  shows  that  a  is  identified  using  Z  as  an 
instrument.  Therefore,  a  is  identified  in  the  original  model. 

It  is  important  to  note  that  the  covariances  in  the  sub¬ 
model  are  not  necessarily  the  same  as  the  covariances  in 
the  original  model.  As  a  result,  the  identified  expressions 
obtained  from  the  sub-model  may  not  apply  to  the  original 
model.  For  example.  Figure  15b  shows  that  a  =  How¬ 
ever,  this  is  clearly  not  the  case  in  Figure  15  a.  The  above 
method  simply  tells  us  that  a  is  identified.  It  does  not  give  us 
the  identified  expression  for  a. 

Tian  (2005)  shows  how  the  covariance  matrix  for  the  sub¬ 
model  can  be  obtained  from  the  original  covariance  matrix 
thus  enabling  us  to  obtain  the  identified  expression  for  the 
parameter  in  the  original  model.  However,  we  do  not  cover 
it  here. 

Advanced  Algorithms 

In  this  subsection,  we  survey  advanced  algorithms  that  uti¬ 
lize  the  path  diagram  to  identify  model  parameters.  The  de¬ 
tails  of  these  algorithms  are  beyond  the  scope  of  this  paper, 
and  we  instead  refer  the  reader  to  the  relevant  literature  for 
more  information. 

Instrumental  variables  and  sets  demonstrate  that  algebraic 
properties  of  linear  independence  translate  to  graphical  prop¬ 
erties  in  the  path  diagram  that  can  be  used  to  identify  model 
coefficients.  The  G-Criterion  algorithm  (Brito,  2004;  Brito 
and  Pearl,  2006)  expands  this  notion  in  order  to  give  a 
method  for  systematically  identifying  the  coefficients  of  a 
recursive  SEM. 
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Figure  16.  A  bow-free  graph;  the  absence  of  a  ‘bow’  pattern 
assures  identification 


This  algorithm  was  generalized  by  Foygel  et  al.  (2012)  to 
determine  identifi ability  of  a  greater  set  of  graphs Addi¬ 
tionally  their  criterion,  called  the  half-trek  criterion,  applies 
to  both  recursive  and  non-recursive  models.  The  half-trek 
algorithm  was  further  generalized  by  Chen  et  al.  (2014)  to 
identify  more  coefficients  in  under-identified  models. 

The  aforementioned  algorithms  of  Brito  (2004),  Foygel 
et  al.  (2012),  and  Chen  et  al.  (2014)  identify  coefficients  by 
searching  for  graphical  patterns  in  the  diagram  that  corre¬ 
spond  to  linear  independence  between  Wright’s  equations. 
Tian  (2005),  Tian  (2007),  and  Tian  (2009)  approach  the  prob¬ 
lem  differently  and  give  algorithms  that  identify  parameters 
by  converting  the  structural  equations  into  orthogonal  partial 
regression  equations. 

Finally,  do-calculus  (Pearl,  2000)  and  non-parametric  al¬ 
gorithms  for  identifying  causal  effects  (Tian  and  Pearl,  2003; 
Shpitser  and  Pearl,  2006;  Huang  and  Valtorta,  2006)  may 
also  be  applied  to  parameter  identification  in  linear  models. 
These  methods  have  been  shown  to  be  complete  for  non- 
parametric  models  (Shpitser  and  Pearl,  2006;  Huang  and  Val¬ 
torta,  2006)  and,  if  theoretically  possible,  are  able  to  identify 
any  expectations  of  the  form  EiY\doiX  =  x,Z  =  z),  where  Z 
represents  any  susbet  of  variables  in  the  model  other  than  X 
and  Y.  As  mentioned  in  the  preliminaries,  a  coefficient  from 
Z  to  T  equals  ^E{Y\do{X  =  x,S  =  s),  where  S  represents 
all  variables  in  the  model  other  than  X  and  Y. 

A  Simple  Criterion  for  Model  Identification 

In  order  to  determine  identifiability  of  the  model  using  the 
single-door  criterion  or  instrumental  variables,  the  modeler 
must  check  the  identifiability  of  each  path  coefficient.  In 
large  and  complex  models,  this  process  can  be  tedious.  In 
this  section,  we  give  a  simple,  sufficient  criterion  that  allows 
the  modeler  to  determine  immediately  whether  a  recursive 
model  is  identified  called  the  how-free  rule  (Brito  and  Pearl, 
2002b;  Brito,  2004).  We  will  see  that  even  a  model  as  com¬ 
plicated  as  Figure  16  can  be  immediately  determined  to  be 
identified  using  this  rule. 


A  bow -arc  is  a  pair  of  variables,  one  of  which  is  a  direct 
function  of  the  other,  whose  error  terms  are  correlated.  This 
is  depicted  in  the  path  diagram  as  a  parent-child  pair  that 
are  also  siblings  and  looks  like  a  bow-arc.  In  Figure  4b,  the 
variables  X  and  Y  create  a  bow-arc. 

Theorem  6.  (Brito  and  Pearl,  2002b)  (Bow -free  Rule)  Ev¬ 
ery  recursive  model  whose  path  diagram  lacks  bow-arcs  is 
identified. 

The  bow-free  rule  is  able  to  identify  models  that  the 
single-door  criterion  is  not.  In  Figure  16,  for  example,  the 
coefficient  a  is  not  identified  using  the  single-door  criterion. 
Attempting  to  block  the  back-door  path,  X\  <r^  X2  ^  Y,  by 
conditioning  on  X2  opens  the  path  <-a  Z2  <-a  T  because  X2 
is  a  descendant  of  the  collider,  Z2.  However,  because  Figure 
16  does  not  contain  any  bow-arcs  it  is  identified  according 
to  Theorem  6.  Finally,  since  the  single-door  criterion  is  un¬ 
able  to  identify  any  model  that  contain  bow-arcs the  bow- 
free  rule  subsumes  the  single-door  criterion  when  applied  to 
model  identification.  (Note  that  the  single-door  criterion  may 
be  able  to  identify  some  coefficients  even  when  the  model  as 
a  whole  is  not  identified.  In  contrast,  the  bow-free  rule  only 
addresses  the  question  of  model  identifiability,  not  the  iden¬ 
tifiability  of  individual  coefficients  in  unidentified  models.) 

Total  Effects 

When  the  model  is  not  identifiable,  modelers  typically 
consider  research  with  SEMs  “impossible”  (Kenny  and  Mi¬ 
lan,  2011)  without  imposing  additional  constraints  or  collect¬ 
ing  additional  data.  However,  as  should  be  clear  from  the 
single-door  criterion  (and  is  acknowledged  by  Kenny  and 
Milan  (2011)),  it  is  often  possible  to  identify  some  of  the 
model  coefficients  even  when  the  model  as  a  whole  is  not 
identifiable.  Further,  we  show  in  this  section  that  it  is  often 
not  necessary  to  identify  all  coefficients  or  even  coefficients 
along  a  causal  path  in  order  to  identify  the  causal  effect  of 
interest^^.  For  example,  in  Figure  17a,  the  total  effect  of  X 
on  F,  ^E{Y\do(X  =  x)],  is  identified  and  equal  to  fiyx  even 
though  it  is  unclear  how  to  identify  b,  d,  or  e.  The  back¬ 
door  criterion,  given  below,  is  a  sufficient  condition  for  the 
identification  of  a  total  effect. 

Theorem  7.  (Pearl,  2000)  (Back-door  Criterion)  For  any 
two  variables  X  and  Y  in  a  causal  diagram  G,  the  total  of 
effect  ofX  on  Y  is  identifiable  if  there  exists  a  set  of  measure¬ 
ments  Z  such  that 

^"^Foygel  et  al.  (2012)  also  released  an  R  package  implementing 
their  algorithm  called  SEMID,  which  determines  whether  the  entire 
model  is  identifiable  given  its  causal  graph. 

^^To  prove  this  statement,  consider  any  model  that  contains  a 
bow-arc  from  X  to  F.  There  is  no  way  to  block  the  path  X  F  and 
identify  the  coefficient  from  X  to  F  using  the  single-door  criterion. 

^^This  fact  was  noted  by  Marschak  (1942)  and  was  dubbed 
“Marschak’s  Maxim”  by  Heckman  (2000). 
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(a) 


y  ^  V 

X  Wi  W2  Y 


Figure  1 7.  Unidentified  graphs  for  which  the  total  effect  of 
X  on  F  is  identified 


(i)  no  member  ofZ  is  a  descendant  ofX;  and 

(ii)  Z  d-separates  X  from  Y  in  the  subgraph  Gx  formed  by 
deleting  from  G  all  arrows  emanating  from  X. 

Moreover,  if  the  two  conditions  are  satisfied,  then  the  total 
effect  ofX  on  Y  is  given  by  fiyx.z- 

Returning  to  the  example  in  Figure  17a  we  see  that  the 
total  of  effect  of  X  on  F,  ^E{Y\do{X  =  x)],  is  fiyx  while  the 
total  effect  of  X  on  F  in  Figure  17b  is  fiyxz- 

Do-calculus  (Pearl,  2000)  and  the  aforementioned  non- 
parametric  algorithms  (Tian  and  Pearl,  2003;  Shpitser  and 
Pearl,  2006;  Huang  and  Valtorta,  2006)  can  also  be  used  to 
identify  total  effects  in  linear  models. 

Model  Testing 

A  crucial  step  of  structural  equation  modeling  is  to  test  the 
structural  and  causal  assumptions  of  the  model,  ensuring  to 
the  best  of  our  ability  that  the  model  specification  accurately 
reflects  the  data  generating  mechanism.  The  most  common 
method  of  testing  a  linear  SEM  is  a  likelihood  ratio  or  chi- 
square  test  that  compares  the  covariance  matrix  implied  by 
the  model  to  that  of  the  sample  covariance  matrix  (Bollen, 
1989;  Shipley,  1997).  While  this  test  simultaneously  tests  all 
of  the  restrictions  implied  by  the  model,  it  relies  critically  on 
our  ability  to  identify  the  model.  Moreover,  bad  fit  does  not 
provide  the  modeler  with  information  about  which  aspect  of 
the  model  needs  to  be  revised.  Finally,  if  the  model  is  very 
large  and  complex,  it  is  possible  that  a  global  chi-square  test 


(a)  (b) 


Vi 


Figure  18.  (a)  Example  illustrating  vanishing  partial  correla¬ 
tion  (b)  The  skeleton  of  the  model  in  (a) 


will  not  reject  the  model  even  when  a  crucial  testable  impli¬ 
cation  is  violated.  Global  tests  represent  summaries  of  the 
overall  model-data  fit  and,  as  a  result,  violation  of  specific 
testable  implications  may  be  masked  (Tomarken  and  Waller, 
2003).  In  contrast,  if  the  testable  implications  are  enumer¬ 
ated  and  tested  individually,  the  power  of  each  test  is  greater 
than  that  of  a  global  test  (Bollen  and  Pearl,  2013;  McDon¬ 
ald,  2002),  and,  in  the  case  of  failure,  the  researcher  knows 
exactly  which  constraint  was  violated.  Path  diagrams  allow 
modelers  to  identify  vanishing  partial  correlations  by  inspec¬ 
tion,  provide  a  necessary  and  sufficient  condition  for  equiva¬ 
lence  among  recursive  models  with  uncorrelated  error  terms 
(often  called  Markovian  models),  and  permit  us  to  predict 
new  types  of  constraints,  beyond  the  vanishing  correlation 
variety. 

Vanishing  Correlation  Constraints 

D-separation  allows  modelers  to  predict  vanishing  partial 
correlations  simply  by  inspecting  the  graph,  and  in  the  case 
of  Markovian  models,  these  vanishing  partial  correlations 
represent  all  of  the  constraints  implied  by  the  model  (Pearl, 
2000).  For  the  example  depicted  in  Figure  18a,  we  obtain 
the  following  vanishing  partial  correlations:  PV2V3.V1  =  0, 
PV1V4.V2V3  =  O5PV2V5.V4  =  andpv3V5.V4  =  0-  If  a  constraint, 
say  PV2V3.V1  =  0  does  not  hold  in  the  dataset,  we  have  reason 
to  believe  that  the  model  specification  is  incorrect  and  should 
reconsider  the  lack  of  edge  between  V2  and  ¥3 . 

In  large  and  complex  graphs,  it  may  be  infeasible  to  list 
all  conditional  independence  constraints  by  inspection.  Ad¬ 
ditionally,  some  constraints  obtained  using  d-separation  may 
be  redundant.  Kang  and  Tian  (2009)  gave  an  algorithm  that 
utilizes  the  graph  to  enumerate  a  set  (not  necessarily  min¬ 
imal)  of  vanishing  partial  correlations  that  imply  all  others 
for  recursive  models  with  or  without  correlated  error  terms. 

Lastly,  we  note  that  d-separation  implies  vanishing  partial 
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correlation  even  in  non-linear  models. 


(a) 


Equivalent  Models 

Since  vanishing  partial  correlations  represent  all  of  the 
constraints  that  Markovian  SEMs  impose  on  the  data,  two 
models  are  observationally  indistinguishable  if  they  share  the 
same  set  of  vanishing  partial  correlations.  In  other  words, 
Markovian  models  that  share  the  same  set  of  vanishing  par¬ 
tial  correlations  cannot  be  distinguished  using  data.  In  this 
case,  we  say  that  the  models  are  covariance  equivalent  since 
every  covariance  matrix  generated  by  one  model  (through 
some  choice  of  parameters)  can  also  be  generated  by  the 
other.  The  skeleton  of  a  graph,  used  in  the  following  theo¬ 
rem,  is  the  undirected  graph  obtained  by  replacing  all  arrows 
with  undirected  edges.  For  example,  the  skeleton  for  Figure 
18a  is  Figure  18b. 

Theorem  8.  (Verma  and  Pearl,  1990)  Two  Markovian 
linear-normal  models  are  covariance  equivalent  if  and  only 
if  they  entail  the  same  sets  of  zero  partial  correlations.  More¬ 
over,  two  such  models  are  covariance  equivalent  if  and  only 
if  their  corresponding  graphs  have  the  same  skeletons  and 
the  same  sets  of  v- structures,  that  is,  two  converging  arrows 
whose  tails  are  not  connected  by  an  arrow. 

The  first  part  of  Theorem  8  defines  the  testable  impli¬ 
cations  of  linear  Markovian  models.  It  states  that,  in  non- 
experimental  studies,  Markovian  SEMs  cannot  be  tested  for 
any  feature  other  than  those  vanishing  partial  correlations 
that  the  d-separation  test  imposes.  It  also  provides  a  sim¬ 
ple  test  for  equivalence  that  requires  merely  a  comparison  of 
corresponding  edges  and  their  directionalities  (Pearl,  2000). 

The  graphs  in  Figures  19b,  19c,  and  19d  are  equivalent 
because  they  are  all  compatible  with  the  graph  in  Figure  19a, 
which  displays  the  skeleton  and  v- structures.  Note  that  we 
cannot  reverse  the  edge  from  V4  to  Vs  since  doing  so  would 
generate  a  new  v-structure,  ¥2  ^  V4  <—  Vs . 

The  graphical  criterion  given  in  Theorem  8  is  necessary 
and  sufficient  for  equivalence  between  Markovian  models. 
It  is  a  necessary  condition  for  equivalence  between  non¬ 
recursive  models  and  models  with  correlated  error  terms 
since  d-separation  in  the  graph  implies  vanishing  partial  cor¬ 
relation  in  the  covariance  matrix.  (Models  that  are  either 
non-recursive  or  have  correlated  error  terms  are  called  non- 
Markovian.  Non-Markovian  models  that  are  recursive  are 
called  semi-Markovian.)  In  contrast,  the  more  prevalent  re¬ 
placement  criterion  (Lee  and  Hershberger,  1990)  is  not  al¬ 
ways  valid^^.  Pearl  (2012)  gave  the  following  example  de¬ 
picted  in  Figure  20.  According  to  the  replacement  crite¬ 
rion,  we  can  replace  the  arrow  X  ^  Y  with  a  bidirected 
edge  X  Y  and  obtain  a  covariance  equivalent  model  when 
all  predictors  (Z)  of  the  effect  variable  (Y)  are  the  same  as 
those  for  the  source  variable  (Z).  Unfortunately,  the  post¬ 
replacement  model  imposes  the  constraint,  pwz.Y  =  0,  which 
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Figure  19.  Models  (b),  (c),  and  (d)  are  equivalent.  The  ar¬ 
rows  in  (a)  cannot  be  reversed. 


V 
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Figure  20.  Counterexample  to  the  standard  Replacement 
Rule;  The  arrow  X  ^  Y  cannot  be  replaced. 


^^The  replacement  rule  violates  the  transitivity  of  equivalence 
(Hershberger,  2006),  yet  it  is  still  used  in  most  of  the  SEM  literature 
(Mulaik,  2009;  Williams,  2011,  pp.  247-260). 
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Vi  -^^V2-^V3-^V4 

Figure  21.  A  graph  illustrating  a  Verma  constraint 


is  not  imposed  by  the  original  model.  This  can  be  seen  from 
the  fact  that,  conditioned  on  Y,  the  path  Z  ^  T  <—  X  IT 
is  unblocked  and  becomes  blocked  if  replaced  by  Z  ^  F 
X  <r^  W.  The  same  applies  to  path  Z  ^  X  <r^  W,  since  Y 
would  cease  to  be  a  descendant  of  X. 

Testable  Implications  in  Semi-Markovian  Models 

In  the  case  of  non-Markovian  models,  additional  testable 
implications  may  be  present,  which  are  not  revealed  by  d- 
separation.  First  noted  by  Verma  and  Pearl  (1990),  these  con¬ 
straints,  often  called  Verma  constraints  in  the  non-parametric 
literature,  impose  invariance  rather  than  conditional  indepen¬ 
dence  restrictions.  Algorithms  that  enumerate  certain  types 
of  Verma  constraints  for  semi-Markovian,  non-parameteric 
SEMs  are  given  by  Tian  and  Pearl  (2002)  and  Shpitser  and 
Pearl  (2008). 

In  Figure  21,  for  example,  one  can  show  that 
regardless  of  the  structural  equations,  the  quantity 
P(V4|V3,  y2,  yi)P(y2\Vi)  is  not  a  function  of  Vy 

Testable  implications  can  also  be  obtained  by  overidenti¬ 
fying  model  parameters'^.  In  some  cases,  these  constraints 
will  be  conditional  independence  constraints  while  in  others 
they  are  not.  The  Verma  constraint  shown  in  Figure  21  is 
obtainable  using  overidentification.  Using  the  single-door 
criterion  (Pearl,  2000)  and  Wright’s  path-tracing  rules,  we 
identify  two  expressions  for  c  in  terms  of  the  covariance  ma¬ 
trix: 

CTa.'xi 

C  =  >043.2  =  and 

l-cr^2 

abc  (T41 
ab  (T31 

Equating  the  two  expressions,  we  obtain  the  following  con- 


1-0-32  0-21(T32 

This  constraint  is  equivalent  to  the  one  obtained  using  the 
non-parametric  methods  of  Tian  and  Pearl  (2002)  and  Sh¬ 
pitser  and  Pearl  (2008).  An  algorithm  for  systematically  dis¬ 
covering  constraints  by  overidentifying  coefficients  using  the 
causal  graph  is  given  by  Chen  et.  al.  (2014).  As  yet,  testable 
implications  in  semi-Markovian  and  non-Markovian  models 
have  not  been  fully  characterized,  and  subsequently,  we  do 
not  have  a  necessary  and  sufficient  condition  for  equivalence 
between  semi-Markovian  or  non-Markovian  models. 


Learning  Structure  from  Data 

The  question  naturally  arises  whether  one  can  learn  the 
structure  of  the  data  generating  model  from  its  data.  In  other 
words,  rather  than  specify  the  structural  equation  model  and 
use  the  data  to  test  it,  can  one  use  the  data  to  discover  aspects 
of  the  model’s  structure?  There  are  a  number  of  algorithms 
that  search  the  data  for  vanishing  partial  correlations  to  ac¬ 
complish  this  goal  for  recursive  models.  See  (Cooper,  1999), 
(Pearl,  2000,  ch.  2),  and  (Spirtes  et  al.,  2000,  chs.  5  and  6)^^ 
for  examples.  For  non-recursive  models.  Hoover  and  Phi- 
romswad  (2013)  make  use  of  overidentifying  constraints  re¬ 
sulting  from  both  multiple  instruments  and  vanishing  partial 
correlations  to  uncover  aspects  of  the  model’s  structure. 

Additional  Applications  of  Graphical  Models 
Equivalent  Regressor  Sets  and  Minimal  Regressor  Sets 

In  some  cases,  we  may  wish  to  know  whether  two  sets, 
when  used  for  adjustment,  have  the  same  asymptotic  bias. 
For  example,  an  investigator  may  wish  to  assess,  prior  to  tak¬ 
ing  any  measurement,  whether  two  candidate  sets  of  covari¬ 
ates,  differing  substantially  in  dimensionality,  measurement 
error,  cost  or  sample  variability  are  equally  valuable  in  their 
bias-reduction  potential  (Pearl  and  Paz,  2010).  This  problem 
pertains  to  prediction  tasks  as  well.  A  researcher  wishing  to 
predict  the  value  of  some  variable  given  a  set  of  observations 
may  wonder  whether  another  set  of  observations  is  a  valid 
substitute. 

In  the  linear  case,  the  problem  can  be  stated  in  the  fol¬ 
lowing  way.  Under  what  conditions  would  replacing  Z  = 
{Zi,  ...,Z„}  with  W  =  {Wi, ...,  Wn}  yield  the  same  value  for  a 
in  the  regression  F  =  QfX-rySiZi-r...-ryS„Z„-r6„,  or  equivalently, 
when  does  Pyx.z  —  ^yx.w^ 

Here  we  adapt  Theorem  3  in  (Pearl  and  Paz,  2010)  for 
linear  SEMs. 

Theorem  9.  Pyxz  =  Pyx.w  if  one  of  the  following  holds: 

(i)  Z  and  W  satisfy  the  back-door  criterion  for  the  total 
effect  of  X  on  Y, 

(ii)  Z  nW  separates  Xfrom  all  other  elements  ofZ  and  W 

If Pyx.z  =  Pyx.w  then  we  say  that  Z  and  W  are  confounding 
equivalent,  or  c-equivalent  for  short. 

Parameters  are  often  described  as  overidentified  when  they 
have  “more  than  one  solution”  (MacCallum,  1995)  or  are  “deter¬ 
mined  from  [the  covariance  matrix]  in  different  ways”  (Joreskog 
et  al.,  1979).  However,  expressing  a  parameter  in  terms  of  the  co- 
variance  matrix  in  more  than  one  way  does  not  necessarily  mean 
that  equating  the  two  expressions  actually  constrains  the  covariance 
matrix.  See  (Pearl,  2004)  for  a  formal  definition  of  parameter  overi¬ 
dentification. 

Software  implementing  these  algorithms  is  available  from  the 
TETRAD  Project  (http://www.phil.cmu.edu/projects/tetrad/). 
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Figure  22.  {V\,W2}  and  {V2,W\}  are  c-equivalent  but  not 
{Wi}  and  {W2} 


Figure  23.  Graph  illustrating  preference  to  Zi  over  Z2; 
VarlfiYx.wz2]  ^  y^^^Wrx.wzi] 


Consider  the  graph  depicted  in  Figure  22.  Let  Z  = 
{Vi,W2}  and  W  =  {Wi,V2}.  Since  both  Z  and  W  satisfy  the 
back-door  criterion  they  are  c-equivalent  and  jSyxz  =  Pyx.w- 
Now  consider  Z  =  {Vi}  and  fk  =  {Vi,  ¥2}.  Z  and  W  no  longer 
satisfy  the  back-door  criterion.  However,  since  ZC\W  =  {Vi} 
separates  X  from  (Z  U  W)  \  Z  n  W  =  {V2},  Z  and  W  are 
c-equivalent  and  jSyx.z  =  Pyx.w- 

C-equivalence  can  also  be  used  to  find  a  minimal  sub¬ 
set  of  regressors  needed  for  estimating  a  given  partial  re¬ 
gression  coefficient.  Consider  a  regression  equation,  Y  = 
aX  jSiZi  ...  PnZn.  What  is  the  smallest  subset  of 
Z  =  {Zi,  ...,Z„}  that  yields  the  same  value  for  the  regression 
coefficient,  al  This  subset  is  unique  and  can  be  found  simply 
by  removing  elements  from  Z  one  at  a  time  such  that  every 
removed  node  is  d-separated  from  X  given  the  remaining  el¬ 
ements  of  Z. 

Variance  Minimization 

In  some  cases,  there  may  be  multiple  sets  that  satisfy  the 
back-door  criterion  when  identifying  a  total  effect.  While 
each  set  provides  an  unbiased  estimate  of  the  causal  effect, 
the  estimates  may  differ  in  their  asymptotic  variance.  As 
a  result,  some  sets  may  be  preferable  to  others.  The  fol¬ 
lowing  theorem  is  adapted  from  Theorem  5  of  (Kuroki  and 
Miyakawa,  2003): 

Theorem  10.  Suppose  that  sets  {W,Zi}  and  {W,Z2}  satisfy 
the  back-door  criterion  relative  to  (Z,  Y)  in  a  linear-normal 
SEM.  If  {W,Z\}  d-separates  X  from  Z2  and  {X,Z2,W}  d- 


separates  Y  from  Zi,  then  V ar[I3Yx.wz2\  ^  y^^\fYx.wzx\ 
other  words,  the  asymptotic  variance  of  the  effect  estimated 
when  controlling  for  {W,Z2}  is  less  than  or  equal  to  the  one 
estimated  by  controlling  for  [W,Zi}. 

For  the  model  depicted  by  Figure  23,  both  {W,Zi}  and 
{W,Z2}  are  back-door  admissible  sets  for  estimating  the  to¬ 
tal  effect  of  X  on  Y.  However,  {W,Z2}  is  preferable  since 
{W,  Zi)  d-separates  X  from  Z2  while  {X,  Z2,  W]  d-separates  Y 
from  Z\.  The  intuition  here  is  that  Z2  is  ‘closer’  to  Y  hence 
more  effective  in  reducing  variations  in  Y  due  to  uncontrolled 
factors.  Similar  results  were  derived  without  graphs  by  Hahn 
(2004). 


Counterfactuals  in  Linear  Models 

We  have  seen  in  the  subsection  on  causal  effects  how  a 
SEM  can  be  used  to  predict  the  effect  of  actions  and  policies 
that  have  never  been  implemented  before.  The  action  of  set¬ 
ting  a  variable,  X,  to  value  v,  is  simulated  by  replacing  the 
structural  equation  for  X  with  the  equation  Z  =  x.  In  this 
section,  we  show  further  that  SEMs  can  be  used  to  answer 
counterfactual  queries.  A  counterfactual  query  asks,  “Given 
that  we  observe  ^  for  a  given  individual,  what  would  we 
expect  the  value  of  B  for  that  individual  to  be  if  A  had  been 
aT  For  example,  given  that  Joe’s  salary  is  5’,  what  would  his 
salary  be  had  he  had  five  more  years  of  education?  This  ex¬ 
pectation  is  denoted  E{BA=a\E  =  e].  The  E  =  e  after  the  con¬ 
ditioning  bar  represents  the  observed  evidence  while  the  sub¬ 
script  A  =  a  represents  a  hypothetical  condition  specified  by 
the  counterfactual  sentence.  Structural  equation  models  are 
able  to  answer  counterfactual  queries  because  each  equation 
represents  an  invariant  mechanism  by  which  a  variable  ob¬ 
tains  its  values.  If  we  identify  these  mechanisms  we  should 
also  be  able  to  predict  what  values  would  be  obtained  had 
circumstances  been  different. 

The  following  model,  depicted  in  Figure  24a,  represents 
an  “encouragement  design”  (Holland,  1988;  Pearl,  2014b) 
where  X  represents  the  amount  of  time  a  student  spends  in  an 
after- school  remedial  program,  ff  the  amount  of  homework  a 
student  does,  and  Y  a  student’s  score  on  the  exam.  The  value 
of  each  variable  is  given  as  the  number  of  standard  deviations 
above  the  mean  so  that  the  model  is  standardized  to  mean  0 
and  variance  1 .  For  example,  if  F  =  1  then  the  student  scored 
1  standard  deviation  above  the  mean  on  his  or  her  exam. 

Model  4. 


X=Ux 

H  =  a- X-\-Uh 
Y  =  bX  +  cH+UY 
cTu.Uj  =  0  for  all  /,  j  e  {Z,  H,  Y] 


We  also  give  the  values  for  the  coefficients  (which  can  be 
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(Homework) 

—  H  — 

b  =  0.7 


c  =  0.4 


(Exam  Score) 

Y 


(b) 

(Encouragement)  (Homework)  (Exam  Score) 


Figure  24.  Answering  counterfactual  question  by  setting  H 
equal  to  2 


estimated  from  population  data): 


a  =  0.5 
b  =  0.7 
c  =  0.4 


In  summary,  we  first  applied  the  evidence  X  =  0.5,^  = 
1,  F  =  1.5  to  update  the  values  for  the  U  variables  or  their 
probabilities.  We  then  simulate  an  external  intervention  to 
force  the  condition  if  =  2  by  replacing  the  structural  equa¬ 
tion  H  =  aX  Uh  with  the  equation  H  =  2.  Finally,  we 
computed  the  value  of  Y  given  the  structural  equations  and 
the  updated  U  values. 

The  following  three  steps  generalize  the  above  procedure 
for  non-linear  systems  and  arbitrary  counterfactuals  of  the 
form,  E[BA=a\E  =  e]  (Pearl,  2000): 

(i)  Abduction  -  Update  P[U]  by  the  evidence  to  obtain 
P[U\E  =  e] 

(ii)  Action  -  Modify  the  model,  M,  by  removing  the  struc¬ 
tural  equations  for  the  variables  in  A  and  replacing  them 
with  the  appropriate  equalities  to  obtain  the  modified 
model,  Ma. 

(iii)  Prediction  -  Use  the  modified  model,  Ma,  and  the  up¬ 
dated  probabilities  over  the  U  variables,  P[U\E  =  ^],  to 
compute  the  expectation  of  B,  the  consequence  of  the 
counterfactual. 


Let  us  consider  a  student  named  Joe,  for  whom  we  mea¬ 
sure  X  =  0.5,  H  =  1,Y  =  1.5.  Suppose  we  wish  to  answer 
the  following  query:  What  would  Joe’s  score  have  been  had 
he  doubled  his  study  time? 

In  a  linear  SEM,  the  value  of  each  variable  in  the  model  is 
determined  by  the  coefficients  and  U  variables,  and  the  latter 
accounts  for  all  variations  among  individuals.  As  a  result,  we 
can  use  the  evidence  X  =  0.5,  H  =  1,  F  =  1.5  to  determine 
the  values  of  the  U  variables  associated  with  Joe.  These  val¬ 
ues  are  invariant  to  external  variations,  such  as  those  which 
might  cause  Joe  to  double  his  homework. 

In  this  case,  we  are  able  to  obtain  the  specific  characteris¬ 
tics  of  Joe  from  the  evidence: 

Ux  =  0.5, 

Uh  =  1-0.5^  0.5  =  0.75,  and 

Uy  =  1.5  -  0.7  •  0.5  -  0.4  •  1  =  0.75. 

Next,  we  simulate  the  action  of  doubling  Joe’s  study  time 
by  replacing  the  structural  equation  for  H  with  the  constant 
H  =  2.  The  modified  model  is  depicted  in  Figure  24b.  Fi¬ 
nally,  we  compute  the  value  of  F  in  our  modified  model  using 
the  updated  U  values  giving: 

Yh=2(Ux  =  0.5,  Uh  =  0.75,  Uy  =  0.75) 

=  0.5  •  0.7  -H  2.0  •  0.4  -H  0.75 
=  1.90 

We  thus  conclude  that  Joe’s  new  score,  predicated  on  dou¬ 
bling  his  homework,  would  have  been  1.9  instead  of  1.5. 


Notice  that  the  above  procedure  applies  not  only  to  retro¬ 
spective  counterfactual  queries  (queries  of  the  form  “What 
would  have  been  the  value  of  F  had  X  been  a?”)  but 
also  prospective  counterfactual  queries  (queries  of  the  form 
“What  will  the  value  of  F  be  if  X  is  set  to  a  by  interven¬ 
tion?”).  For  example,  suppose  we  wish  to  estimate  the  effect 
on  test  score  provided  by  a  school  policy  that  sends  students 
who  are  lazy  on  their  homework  (S  <  -1)  to  attend  the  af¬ 
terschool  program  for  X  =  1.  The  expected  value  of  this 
quantity  is  denoted  E[Yx=i\S  <  -1]  and  can,  in  principle, 
be  computed  using  the  above  three  step  method.  Counter- 
factual  reasoning  and  the  above  procedure  are  necessary  for 
estimating  the  effect  of  actions  and  policies  on  subsets  of  the 
population  characterized  by  features  that,  in  themselves,  are 
policy  dependent  (e.g.  5  <  -1). 

In  non-parametric  models,  counterfactual  quantities  of 
the  form  E[BA=a\E  =  e]  may  not  be  identifiable,  even  if 
we  have  the  luxury  of  running  experiments  (Pearl,  2009). 
In  linear  models,  however,  any  counterfactual  of  the  form, 
E[Yx=x\E  =  e],  with  e  an  arbitrary  evidence,  is  identified 
whenever  E[Y\do(X  =  a)]  is  identified  (Pearl,  2009,  p.  389). 
As  a  result,  if  the  data  generating  mechanism  is  linear,  any 
counterfactual  quantity  is  identifiable  whenever  the  model 
parameters  are  identifiable^^. 

Theorem  11.  (Pearl,  2009)  Let  T  he  the  slope  of  the  total 
effect  ofX  on  Y,  §-^E[Y\do{X  =  x)l  then  E[Yx=x\E  =  e]  = 
E[Y\E  =  e]  +  T(x  -  E[X\E  =  e]). 

^^Any  expectation  of  the  form  E[Y\do(X  =  a)]  can,  of  course,  be 
identified  experimentally  by  randomizing  the  value  of  X  and  com¬ 
puting  the  average  of  F  over  the  population  for  which  X  =  a. 
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Zi  Z2 

iX  Xi 

Wi  Z3  W2 

iX  Xi 

X  — ^  W3  ^  Y 

Figure  25.  Graph  corresponding  to  Model  5  in  text 


This  provides  an  intuitive  interpretation  of  counterfactu- 
als  in  linear  models:  E{Yx=x\E  =  e]  can  be  computed  by 
first  calculating  the  best  estimate  of  Y  conditioned  on  the  ev¬ 
idence  e,  E{Y\e\  and  then  adding  to  it  whatever  change  is  ex¬ 
pected  in  Y  when  X  is  shifted  from  its  current  best  estimate, 
E{X\E  =  to  its  hypothetical  value,  v. 

Methodologically,  the  importance  of  Theorem  1 1  lies  in 
enabling  researchers  to  answer  hypothetical  questions  about 
individuals  (or  set  of  individuals)  from  population  data.  The 
ramifications  of  this  feature  in  legal  contexts  and  political 
science  are  explored,  respectively,  in  (Pearl,  2009,  ch.  9)  and 
Yamamoto  (2012). 


Example  Problems 

In  this  section,  we  apply  graphical  tools  to  solve  non¬ 
trivial  problems  that  SEM  researchers  are  likely  to  encounter. 

Model  5. 


Y  =  aW^+  bZ^  +  CW2  +  U 
W2  =  c^X  +  f/' 

Z^?>  =  Cl3^l  +  ^3-^2  + 

W2  =  C2Z2  +  U2 


X  =  tiWi  -r  t2Z3  -r  U' 
ITi  =  a[Zi  -r  U[ 

Zi  =  Ui 

Z2  =  U2 


Given  the  model  depicted  above,  we  pose  the  following 
questions: 


(i)  Identify  three  testable  implications  of  this  model 

(ii)  Identify  a  testable  implication  assuming  that  only  X,  T, 
W3,  and  Z3  are  observed 

(iii)  Suppose  X,  F,  and  W3  are  the  only  variables  observed. 
Which  parameters  can  be  identified  from  the  data? 

(iv)  If  we  regress  Zi  on  all  other  variables  in  the  model, 
which  regression  coefficient  will  be  zero? 

(v)  The  model  in  Figure  25  implies  that  certain  regression 
coefficients  will  remain  invariant  when  an  additional 
variable  is  added  as  a  regressor.  Identify  five  such  co¬ 
efficients  with  their  added  regressors. 


Solutions: 


Figure  26.  Graph  representing  Model  5  when  Zi ,  Wi ,  Z2,  and 
W2  are  unobserved 


(i)  Figure  25  shows  that  {Wi,Z3,  W2,  W3}  blocks  all  paths 
between  X  and  F.  Therefore,  o-xy.WiZ2,W2W3  -  0.  Fike- 
wise,  {IFi,Z3}  blocks  all  paths  between  X  and  Z\  and 
{Z3,  W2}  blocks  all  paths  between  F  and  Z2.  Asa  result, 

o'xzi.WuZ^  =  0  and  cr YZ2.Z3W2  =  0- 

(ii)  When  X,  F,  W3,  and  Z3  are  latent  variables.  Model  5  is 
equivalent  to  the  graph  in  Figure  26.  We  see  that  W3  is 
d-separated  from  Z3  by  X.  Therefore,  (Tw^z^.x  =  0. 

(iii)  C3  is  identified  using  the  single-door  criterion.  When 
we  remove  the  edge  X  W3,  X  is  d-separated  from 
W3 .  Fikewise,  a  can  be  identified  using  the  single-door 
criterion.  When  we  remove  the  edge  W3  F,  W3  is 
d-separated  from  F  by  X.  Therefore,  C3  =  jSw^x  and 
a  =  Pyw^.x- 

(iv)  The  coefficients  for  Y,  W3,  W2,  and  F  will  be  zero  since 
they  are  d-separated  from  Zi  by  {Wi, Z3, Z2}.  The  coef¬ 
ficient  for  Z2  may  not  be  zero  since  Z3  is  a  collider. 

(v)  (a)  Pyx.WiZs  =  ^YX.WuZ^Zi  since  both  {Wi,Z3}  and 

{Wi,  Z3,  Zi)  satisfy  the  back-door  criterion  for  the 
total  effect  of  X  on  F. 

(b)  I3yw3.x  =  Pyw2.xw,  since  {X}  and  {X,Wi}  satisfy 
the  back-door  criterion  for  the  total  effect  of  W3 
on  F. 

(c)  I3z2Zi  =  l3z2Zi.Wi  since  Z2  is  d-separated  from  Zi  by 
0  and  Wi .  As  a  result,  both  regression  coefficients 
vanish. 

(d)  PYW2.Z2  =  PYW2.Z2Z3ZX  since  both  {Z2}  and 
{Z2,  Z3,  Zi}  satisfy  the  back-door  criterion  for  the 
total  effect  of  W2  on  F. 

(e)  l3wiZi  =  fiwiZi.z^  since  both  0  and  {Z3}  satisfy  the 
back-door  criterion  for  the  total  effect  of  Zi  on 
Wi. 


Conclusion 

The  benefit  of  graphs  are  usually  attributed  to  their  abil¬ 
ity  to  represent  theoretical  assumptions  visibly  and  trans¬ 
parently,  by  abstracting  away  unnecessary  algebraic  details. 
What  is  not  generally  recognized  is  graphs’  ability  to  serve 
as  efficient  computational  engines.  This  paper  demonstrates 
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how  graphs  can  compute  the  testable  implications  of  model¬ 
ing  assumptions  and  how  they  can  combine  those  assumption 
with  data  and  generate  quantitative  answers  to  both  statistical 
and  causal  questions  about  populations  and  individuals. 

We  showed  that  a  few  basic  principles  of  reading  van¬ 
ishing  partial  correlations  from  graphs  can  give  rise  to  new 
methods  of  model  testing  and  identification  that  far  exceed 
traditional  methods  of  SEM.  The  construction  of  equivalent 
models  and  characterization  of  instrumental  variables  follow 
directly  from  these  principles.  Auxiliary  techniques  of  coun- 
terfactual  analysis  further  permit  researchers  to  quantify  indi¬ 
vidual  behavior  from  population  data  and  to  reason  backward 
into  alternative  courses  of  action. 

Graphical  representations  have  become  an  indispensable 
second  language  in  the  health  sciences  (Glymour  and  Green¬ 
land,  2008;  Lange  et  al.,  2012)  and  are  making  their  way  to¬ 
wards  the  social  and  behavioral  sciences  (Chalak  and  White, 
2011;  Lee,  2012;  Morgan  and  Winship,  2007).  We  hope  that 
the  potential  of  these  tools  will  be  recognized  by  SEM  re¬ 
searchers  as  well. 
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